summaryrefslogtreecommitdiffstats
path: root/doc/idutils.texi
diff options
context:
space:
mode:
Diffstat (limited to 'doc/idutils.texi')
-rw-r--r--doc/idutils.texi1322
1 files changed, 1322 insertions, 0 deletions
diff --git a/doc/idutils.texi b/doc/idutils.texi
new file mode 100644
index 0000000..7c5cd34
--- /dev/null
+++ b/doc/idutils.texi
@@ -0,0 +1,1322 @@
+\input texinfo
+@comment %**start of header
+@setfilename idutils.info
+@settitle ID database utilities
+@comment %**end of header
+
+@include version.texi
+
+@c Define new indices for file names and options.
+@defcodeindex fl
+@defcodeindex op
+
+@c Put everything in one index (arbitrarily chosen to be the concept index).
+@syncodeindex fl cp
+@syncodeindex fn cp
+@syncodeindex ky cp
+@syncodeindex op cp
+@syncodeindex pg cp
+@syncodeindex vr cp
+
+@ifinfo
+@format
+START-INFO-DIR-ENTRY
+* ID database: (idutils). Identifier database utilities.
+* mkid: (idutils)mkid invocation. Creating an ID database.
+* lid: (idutils)lid invocation. Matching words and patterns.
+* fid: (idutils)fid invocation. Listing a file's tokens.
+* fnid: (idutils)fnid invocation. Looking up file names.
+* xtokid: (idutils)xtokid invocation. Testing mkid scanners.
+END-INFO-DIR-ENTRY
+@end format
+@end ifinfo
+
+@ifinfo
+This file documents the @file{idutils} database utilities.
+
+Copyright (C) 1996, 1999, 2000 Free Software Foundation, Inc.
+
+Permission is granted to make and distribute verbatim copies of
+this manual provided the copyright notice and this permission notice
+are preserved on all copies.
+
+@ignore
+Permission is granted to process this file through TeX and print the
+results, provided the printed document carries copying permission
+notice identical to this one except for the removal of this paragraph
+(this paragraph not being relevant to the printed manual).
+
+@end ignore
+Permission is granted to copy and distribute modified versions of this
+manual under the conditions for verbatim copying, provided that the entire
+resulting derived work is distributed under the terms of a permission
+notice identical to this one.
+
+Permission is granted to copy and distribute translations of this manual
+into another language, under the above conditions for modified versions,
+except that this permission notice may be stated in a translation.
+@end ifinfo
+
+@titlepage
+@title ID database utilities
+@subtitle Programs for simple, fast, high-capacity cross-referencing
+@subtitle for version @value{VERSION}
+@author Greg McGary
+@author Tom Horsley
+@end titlepage
+
+@ifinfo
+@c ************* gkm *********************************************************
+@node Top
+@top ID utilities
+
+This manual documents version @value{VERSION} of the ID utilities.
+
+@menu
+* Introduction:: Overview of the tools with tutorial.
+* Quick start:: Quick start procedure.
+* Common options:: Common command-line options.
+* mkid invocation:: Creating an ID database.
+* lid invocation:: Querying an ID database by token.
+* fid invocation:: Listing a file's tokens.
+* fnid invocation:: Looking up file names.
+* xtokid invocation:: Testing language scanners.
+* Past and Future:: History and future directions.
+* Index:: General index.
+@end menu
+@end ifinfo
+
+@c ************* gkm *********************************************************
+@node Introduction
+@chapter Introduction
+
+@cindex overview
+@cindex introduction
+@cindex ID database, definition of
+
+An @dfn{ID database} is a binary file containing a list of file names, a
+list of tokens, and a sparse matrix indicating which tokens
+appear in which files.
+
+With this database and some tools to query it (described in this
+manual), many text-searching tasks become simpler and faster. For
+example, you can list all files that reference a particular
+@code{#include} file throughout a huge source hierarchy, search for all
+the memos containing references to a project, or automatically invoke an
+editor on all files containing references to some function or variable.
+Anyone with a large software project to maintain, or a large set of text
+files to organize, can benefit from the ID utilities.
+
+Although the name `ID' is short for `identifier', the ID utilities
+handle more than just identifiers; they also treat other kinds of
+tokens, most notably numeric constants, and the contents of certain
+character strings. Thus, this manual will use the word @dfn{token} as a
+term that is inclusive of identifiers, numbers and strings.
+
+There are several programs in the ID utilities family:
+
+@table @file
+
+@item mkid
+scans files for tokens and builds the ID database file.
+
+@item lid
+queries the ID database for tokens, then reports matching file names or
+matching lines.
+
+@item fid
+lists all tokens recorded in the database for given files, or
+tokens common to two files.
+
+@item fnid
+matches the file names in the database, rather than the tokens.
+
+@item xtokid
+extracts raw tokens---helps with testing of new @file{mkid} scanners.
+
+@end table
+
+In addition, the ID utilities have historically provided several query
+programs which are specializations of @file{lid}:
+
+@table @file
+
+@item gid
+(alias for @samp{lid -R grep})
+lists all lines containing the requested pattern.
+
+@item eid
+(alias for @samp{lid -R edit})
+invokes an editor on all files containing the requested pattern, and
+if possible, initiates a text search for that pattern.
+
+@item aid
+(alias for @samp{lid -ils}) treats the requested pattern
+as a case-insensitive literal substring.
+
+@end table
+
+@cindex bugs, reporting
+Please report bugs to @samp{bug-gnu-utils@@gnu.ai.mit.edu}. Remember to
+include the version number, machine architecture, input files, and any
+other information needed to reproduce the bug: your input, what you
+expected, what you got, and why it is wrong. Diffs are welcome, but
+please include a description of the problem as well, since this is
+sometimes difficult to infer. @xref{Bugs, , , gcc, GNU CC}.
+
+@c ************* gkm *********************************************************
+@node Quick start
+@chapter Quick Start Procedure
+
+@table @bullet
+
+Unpack the distribution.
+
+Type @file{./configure}
+
+Type @samp{make}
+
+Type @samp{make install} as a user with the appropriate privileges
+(e.g., @samp{bin} or perhaps even @samp{root}).
+
+Type @samp{cd /usr/include; mkid} to build an ID database covering
+all of the system header files.
+
+Type @samp{lid FILE}, then @samp{gid strtok}, then @samp{aid stdout}.
+
+@end table
+
+You have just built, installed and used the most common commands of the
+GNU ID utilities. If you ever need help remembering which system header
+files contain a particular declaration, or reference a particular symbol,
+you'll want to keep the ID file you built in @file{/usr/include} for
+later use. If your working directory is elsewhere at the time, simply
+provide the @samp{-f /usr/include} option to @file{lid} (@pxref{Reading
+options}).
+
+@c ************* gkm *********************************************************
+@node Common options
+@chapter Common command-line options
+
+@cindex common command-line options
+
+Certain options, and regular expression syntax, are shared by various
+groupings of the ID utilities. We describe these in the sections below,
+rather than repeating them for each program.
+
+@menu
+* Universal options:: Options common to all programs.
+* Extraction options:: Options for programs that extract tokens from source files.
+* Walker options:: Options for programs that walk file and directory trees.
+* Reading options:: Options for programs that read ID databases.
+* Writing options:: Options for programs that write ID databases.
+* File listing options:: Options for programs that list file names.
+@end menu
+
+@c ************* gkm *********************************************************
+@node Universal options
+@section Options Common to All Programs
+
+@table @samp
+
+@item --help
+@opindex --help
+@cindex help, online
+Print a usage message listing all available options, then exit successfully.
+
+@item --version
+@opindex --version
+@cindex version number, finding
+Print the version number, then exit successfully.
+
+@end table
+
+@c ************* gkm *********************************************************
+@node Reading options
+@section Options for Programs that Read ID Databases
+
+@table @samp
+
+@item -f @var{filename}
+@itemx --file=@var{filename}
+@opindex -f
+@opindex --file
+@cindex ID database file name
+
+@var{Filename} is the ID database to read when processing queries. At
+present, only a single @samp{--file} option is processed, but in future
+releases, more than one ID database may be named on the command line.
+
+@item $IDPATH
+@cindex ID database file name
+
+@samp{IDPATH} is an environment variable that contains a
+colon-separated list of ID database names. If this variable is present,
+and no @samp{--file} options are presented on the command line, the ID
+databases named in @samp{IDPATH} are implied.@footnote{At present, this
+feature isn't fully implemented, since only the first of a list of ID
+database names is processed.}
+
+@end table
+
+If no ID databases are specified either on the command line or via the
+@samp{IDPATH} environment variable, then the ID utilities search for a
+file named @file{ID} in the current working directory, and then in
+successive parent directories.
+
+@c ************* gkm *********************************************************
+@node Writing options
+@section Options for Programs that Write ID Databases
+
+@table @samp
+
+@item -o @var{filename}
+@itemx --output=@var{filename}
+@opindex -o
+@opindex --output
+@cindex ID database file name
+
+The @samp{--output} option names the file in which to write a new ID
+database. If no @samp{--output} (or @samp{--file}) option is present,
+an output file named @file{ID} is implied.
+
+@item -f @var{filename}
+@itemx --file=@var{filename}
+@opindex -f
+@opindex --file
+@cindex ID database file name
+
+This is a synonym for @samp{--output}
+
+@end table
+
+@c ************* gkm *********************************************************
+@node Walker options
+@section Options for Programs that Walk File and Directory Trees.
+
+The programs @file{mkid} and @file{xtokid} accept the names of files and
+directories on the command line. Files are scanned if there is a
+scanner available and enabled for the file's source language.
+Directories are recursively descended, searching for files whose names
+match the rules listed in the @emph{language map} file (@pxref{Language
+map}).
+
+The following option controls the file tree walker:
+
+@table @samp
+
+@item -p @var{names}
+@itemx --prune=@var{names}
+@opindex -p
+@opindex --prune
+@cindex file tree pruning
+
+One or more file or directory names may appear in @var{names}. The file
+tree walker will stop short at these files and directories and their
+contents will not be scanned.
+
+@end table
+
+@c ************* gkm *********************************************************
+@node File listing options
+@section Options for Programs that List File Names
+
+The programs @file{lid} and @file{fnid} can print lists of file names as
+the result of queries. The following option controls how these lists
+are formatted:
+
+@table @samp
+
+@item -S @var{style}
+@itemx --separator=@var{style}
+@opindex -S
+@opindex --separator
+@cindex file name separator
+
+@var{Style} may be one of @samp{braces}, @samp{space} or @samp{newline}.
+
+The @var{style} of @samp{braces} means that file names with common
+directory prefix and common suffix are printed using the shell's brace
+notation in order to compress the output. For example,
+@file{../src/foo.c ../src/bar.c} can be printed in brace notation as
+@file{../src/@{foo,bar@}.c}.
+
+The @var{style}s of @samp{space} and @samp{newline} mean that file names
+are separated spaces or by newlines, respectively.
+
+If the list of files is being printed on a terminal, brace notation is
+the default. If not, file names are separated by spaces if the
+@var{key} is included in the output, and by newlines the @var{key style}
+is @samp{none} (@pxref{lid invocation}).
+
+@end table
+
+@c ************* gkm *********************************************************
+@node Extraction options
+@section Options for Programs that Scan Source Files
+
+@file{mkid} and @file{xtokid} walk file trees, select source files by
+name, and extract tokens from source files. They accept the following
+options:
+
+@table @samp
+
+@item -m @var{mapfile}
+@itemx --lang-map=@var{mapfile}
+@opindex -m
+@opindex --lang-map
+@cindex language map file
+
+@var{mapfile} contains rules for determining the source languages from
+file names @xref{Language map}.
+
+@item -i @var{languages}
+@itemx --include=@var{languages}
+@opindex -i
+@opindex --include
+@cindex include languages
+
+The @samp{--include} option names @var{languages} whose source files
+should be scanned and incorporated into the ID database. By default,
+all languages known to the ID utilities are enabled.
+
+@item -x @var{languages}
+@itemx --exclude=@var{languages}
+@opindex -x
+@opindex --exclude
+@cindex exclude languages
+
+The @samp{--exclude} option names @var{languages} whose source files
+should @var{not} be scanned. The default list of excluded languages is
+empty. Note that only one of @samp{--include} or @samp{--exclude} may
+be specified on the command line for a single run.
+
+@item -l @var{language}:@var{options}
+@itemx --lang-option=@var{language}:@var{options}
+@opindex -l
+@opindex --lang-option
+@cindex language-specific option
+
+Language-specific scanners also accept options. @var{Language} denotes
+the desired scanner, and @var{option} are the command-line options that
+should be passed through to it. For example, to pass the @var{-x
+--coke-bottle} options to the scanner for the language @var{swizzle},
+pass this: @var{-l swizzle:"-x --coke-bottle"}, or this:
+@var{-lang-option=swizzle:"-x --coke-bottle"}, or this: @var{-l
+swizzle-x -l swizzle:--coke-bottle}. Use the @samp{--help} option to
+see the command-line option summary for
+
+@end table
+
+@cindex scanners
+
+To determine which tokens to extract from a file and store in the
+database, @file{mkid} calls a @dfn{scanner}; we say a scanner
+@dfn{recognizes} a particular language. Scanners for several languages
+are built-in to @file{mkid}; you can add your own scanners as well, as
+explained in @ref{Defining scanners}.
+
+The ID utilities determine which scanner to use for a particular file by
+consulting the language-map file. Scanners for several are already
+built-in to the ID utilities. You can see which languages have built-in
+scanners, and examine their language-specific options by invoking
+@samp{mkid --help} or @samp{xtokid --help}.
+
+@menu
+* Language map:: Mapping file names to source languages.
+* C/C++ scanner:: For the C and C++ programming language.
+* Assembler scanner:: For assembly language.
+* Text scanner:: For documents or other non-source code.
+* Perl scanner:: For Perl language (experimental).
+* Defining scanners:: Defining new scanners in the source code.
+@end menu
+
+@c ************* gkm *********************************************************
+@node Language map
+@subsection Mapping file names to source languages
+
+The file @file{id-lang.map}, installed by default in
+@file{$(prefix)/share/id-lang.map}, contains rules for mapping file
+names to source languages. Each rule comprises three parts: a shell
+@var{glob} pattern, a language name, and language-specific scanner
+options.
+
+The special pattern @samp{**} denotes the default source language. This is
+the language that's assigned to file names that don't match any other
+pattern.
+
+The special pattern @samp{***} should be followed by a file name. The
+named file should contain more language-map rules and is included at
+this point.
+
+The order in which rules are presented in a language-map file is
+significant. This order influences the order in which files are
+displayed as the result of queries. For example, the distributed
+language-map file places all rules for C @var{.h} files ahead of
+@var{.c} files, so that in general, declarations will precede
+definitions in query output. The same thing is done for C++ and its
+many different source file name extensions.
+
+Here is a pared-down version of the @file{id-lang.map} file distributed
+with the ID utilities:
+
+@example
+
+# Default language
+** IGNORE # Although this is listed first,
+ # the default language pattern is
+ # logically matched last.
+
+# Backup files
+*~ IGNORE
+*.bak IGNORE
+*.bk[0-9] IGNORE
+
+# SCCS files
+[sp].* IGNORE
+
+# list header files before code files
+*.h C
+*.h.in C
+*.H C++
+*.hh C++
+*.hpp C++
+*.hxx C++
+
+# list C `meta' files next
+*.l C
+*.lex C
+*.y C
+*.yacc C
+
+# list C code files after header files
+*.c C
+*.C C++
+*.cc C++
+*.cpp C++
+*.cxx C++
+
+# list assembly language after C
+*.[sS] asm --comment=;
+*.asm asm --comment=;
+
+# [nt]roff
+*.[0-9] roff
+*.ms roff
+*.me roff
+*.mm roff
+
+# TeX and friends
+*.tex TeX
+*.ltx TeX
+*.texi texinfo
+*.texinfo texinfo
+
+@end example
+
+@c ************* gkm *********************************************************
+@node C/C++ scanner
+@subsection C/C++ Language Scanner
+
+@cindex C scanner, predefined
+
+The C scanner is the most commonly used. Files that match the glob
+pattern @file{*.h}, @file{*.c}, as well as @file{yacc} files that match
+@file{*.y} or @file{*.yacc}, and @file{lex} files that match @file{*.l}
+or @file{*.lex}, are processed with this scanner.
+
+Scanner-specific options (Note, these options are presented
+@var{without} the required @samp{-l} or @samp{--lang-option=} prefix):
+
+@table @samp
+
+@item -k @var{character-class}
+@itemx --keep=@var{character-class}
+@opindex -k
+@opindex --keep
+@opindex -l C:-k
+@opindex -l C:--keep
+@opindex --lang-option=C:-k
+@opindex --lang-option=C:--keep
+
+Consider the characters in @var{character-class} as valid constituents of
+identifier names. For example, if you are indexing C code that contains
+@samp{$} in some of its identifiers, you can include these by using
+@samp{--lang-option=C:--keep=$}, or @samp{-l C:"-k $"} (if you don't like
+to type so much).
+
+@item -i @var{character-class}
+@itemx --ignore=@var{character-class}
+@opindex -i
+@opindex --ignore
+@opindex -l C:-i
+@opindex -l C:--ignore
+@opindex --lang-option=C:-i
+@opindex --lang-option=C:--ignore
+
+Consider the characters in @var{character-class} as valid constituents of
+identifier names, but discard all tokens containing these characters.
+For example, if some C code has identifiers containing @samp{$}, but you
+don't want these cluttering up your ID database, use
+@samp{--lang-option=C:--ignore=$}, or the terser equivalent @samp{-l
+C:"-i $"}.
+
+@item -u
+@itemx --strip-underscore
+@opindex -u
+@opindex --strip-underscore
+@opindex -l C:-u
+@opindex -l C:--strip-underscore
+@opindex --lang-option=C:-u
+@opindex --lang-option=C:--strip-underscore
+
+Strip one leading underscore from C identifiers encapsulated as
+character strings. This option is useful if you are indexing C code
+that contains symbol-table name strings for systems that prepend an
+underscore to external symbols. By default, the leading underscore is
+retained.
+
+@end table
+
+@c ************* gkm *********************************************************
+@node Assembler scanner
+@subsection Assembly Language Scanner
+
+@cindex assembler scanner
+@cindex assembly language scanner
+
+Assembly languages use a variety of commenting conventions, and allow a
+variety of special characters to @emph{dirty up} local symbols,
+preventing name space conflicts with symbols defined by higher-level
+languages. Also, some compilation systems prepend an underscore to
+external symbols. The options listed below are designed to address
+these differences.
+
+@table @samp
+
+@item -c @var{character-class}
+@itemx --comment=@var{character-class}
+@opindex -c
+@opindex --comment
+@opindex -l asm:-c
+@opindex -l asm:--comment
+@opindex --lang-option=asm:-c
+@opindex --lang-option=asm:--comment
+
+The characters in @var{character-class} are considered left delimiters
+for comments that extend until the end of the current line.
+
+@item -k @var{character-class}
+@itemx --keep=@var{character-class}
+@opindex -k
+@opindex --keep
+@opindex -l asm:-k
+@opindex -l asm:--keep
+@opindex --lang-option=asm:-k
+@opindex --lang-option=asm:--keep
+
+Consider the characters of @var{character-class} as valid constituents of
+identifier names. For example, if you are indexing assembly code that
+prepends @samp{.} to assembler directives, and prepends @samp{%} to
+register names, you can keep these characters in the tokens by specifying
+@samp{--lang-option=asm:--keep=.%}, or @samp{-l asm:"-k .%"}.
+
+@item -i @var{character-class}
+@itemx --ignore=@var{character-class}
+@opindex -i
+@opindex --ignore
+@opindex -l asm:-i
+@opindex -l asm:--ignore
+@opindex --lang-option=asm:-i
+@opindex --lang-option=asm:--ignore
+
+Consider the characters of @var{character-class} as valid constituents
+of identifier names, but discard all tokens containing these characters.
+For example, if you don't want to clutter your ID database with
+assembler directives that begin with a leading @samp{.} or with
+assembler labels that contain @samp{@@}, use
+@samp{--lang-option=asm:--ignore=.@@}, or @samp{-l asm:"-i .@@"}.
+
+@item -u
+@itemx --strip-underscore
+@opindex -u
+@opindex --strip-underscore
+@opindex -l asm:-u
+@opindex -l asm:--strip-underscore
+@opindex --lang-option=asm:-u
+@opindex --lang-option=asm:--strip-underscore
+
+Strip one leading underscore from identifiers. This option is useful if
+your compilation system prepends an underscore to external symbols. By
+stripping the underscore, you can canonicalize such names and bring them
+into conformance the way they are expressed in the C language. By
+default, the leading underscore is retained.
+
+@item -n
+@itemx --no-cpp
+@opindex -n
+@opindex --no-cpp
+@opindex -l asm:-n
+@opindex -l asm:--no-cpp
+@opindex --lang-option=asm:-n
+@opindex --lang-option=asm:--no-cpp
+
+Do not recognize C preprocessor directives. By default, such lines are
+handled in the same way as they are by the C language scanner.
+
+@end table
+
+@c ************* gkm *********************************************************
+@node Text scanner
+@subsection Text Scanner
+
+@cindex text scanner
+
+The plain text scanner is intended for human-language documents, or as the
+scanner of last resort for files that have no scanner that is more
+specific. It is customizable to the extent that character classes can
+be designated as token constituents or as token delimiters. The default
+token constituents are the alpha-numerics; all other characters are
+considered token delimiters.
+
+@table @samp
+
+@item -i @var{character-class}
+@itemx --include=@var{character-class}
+@opindex -i
+@opindex --include
+@opindex -l text:-i
+@opindex -l text:--include
+@opindex --lang-option=text:-i
+@opindex --lang-option=text:--include
+
+Include characters belonging to @var{character-class} in tokens.
+
+@item -x @var{character-class}
+@itemx --exclude=@var{character-class}
+@opindex -x
+@opindex --exclude
+@opindex -l text:-x
+@opindex -l text:--exclude
+@opindex --lang-option=text:-x
+@opindex --lang-option=text:--exclude
+
+Exclude characters belonging to @var{character-class} from tokens, i.e., treat
+them as token delimiters.
+
+@end table
+
+@c ************* gkm *********************************************************
+@node Perl scanner
+@subsection Perl Scanner
+
+@cindex perl scanner
+(EXPERIMENTAL)
+
+The perl scanner is intended for perl-language documents. Tokens are all
+words, Perl Keywords are included. Comments & string declarations are
+ignored, as well as the documentation. It is customizable to the extent
+that character classes can be designated as token constituents or as
+token delimiters. The default token constituents are the alpha-numerics;
+all other characters are considered token delimiters.
+
+@table @samp
+
+@item -i @var{character-class}
+@itemx --include=@var{character-class}
+@opindex -i
+@opindex --include
+@opindex -l perl:-i
+@opindex -l perl:--include
+@opindex --lang-option=perl:-i
+@opindex --lang-option=perl:--include
+
+Include characters belonging to @var{character-class} in tokens.
+
+@item -x @var{character-class}
+@itemx --exclude=@var{character-class}
+@opindex -x
+@opindex --exclude
+@opindex -l perl:-x
+@opindex -l perl:--exclude
+@opindex --lang-option=perl:-x
+@opindex --lang-option=perl:--exclude
+
+Exclude characters belonging to @var{character-class} from tokens, i.e., treat
+them as token delimiters.
+
+@item -d
+@itemx --dtags
+@opindex -d
+@opindex --dtags
+@opindex -l asm:-d
+@opindex -l asm:--dtags
+@opindex --lang-option=perl:-d
+@opindex --lang-option=perl:--dtags
+
+Include tokens from the documentation. By default, the tokens in the
+documentation are ignored.
+
+@end table
+
+@c ************* gkm *********************************************************
+@node Defining scanners
+@subsection Defining New Scanners in the Source Code
+
+@flindex scanners.c
+@cindex scanners, defining in source code
+
+@vindex languages_0
+
+To add a new scanner in source code, you should add a new section to the
+file @file{scanners.c}. It might be easiest to clone one of the
+existing scanners and modify it as necessary. For the hypothetical
+language @var{foo}, you must define the functions @code{get_token_foo},
+@code{parse_args_foo}, @code{help_me_foo}, as well as the tables
+@code{long_options_foo} and @code{args_foo}. If your scanner is
+modeled after one of the existing scanners, you'll also need a
+character-attribute table @code{ctype_foo}.
+
+This is not a terribly difficult programming task, but it requires
+recompiling and installing the new version of @file{mkid} and @file{xtokid}.
+You should use @file{xtokid} to test the operation of the new scanner.
+
+Once these functions and tables are ready, add function prototypes and
+an entry to the @code{languages_0} table near the beginning of the file.
+
+Be warned that the existing scanners are built for speed, not elegance
+or readability. You might wish to create a new scanner that's easier to
+read and understand if you don't feel that speed is so important.
+
+@c ************* gkm *********************************************************
+@node mkid invocation
+@chapter @samp{mkid}: Creating an ID Database
+@cindex creating databases
+@cindex databases, creating
+@cindex ID file format
+@cindex architecture-independence
+@cindex sharing ID files
+
+@file{mkid} builds an ID database. It accepts the names of files and/or
+directories on the command line, selects files that have an enabled
+scanner, then extracts and stores tokens from those files. The
+resulting ID database is architecture- and byte-order-independent so it
+can be shared among all systems.
+
+The primary virtues of @file{mkid} are speed and high capacity. The
+size of the source trees it can index is limited only by available
+system memory. @file{mkid}'s indexing algorithm is very space-efficient
+and exhibits excellent locality-of-reference, and so is capable of
+operating with a working-set size that is only half the size of its
+virtual address space. A typical @sc{unix}-like operating system with
+16 megabytes of system memory should be able to build an ID database
+covering approximately 12,000-14,000 source files totaling
+approximately 50--100 Megabytes. A 66 MHz 486 computer can build such
+a large ID database in approximately 10-15 minutes.
+
+@pindex cron
+In a future release, @file{mkid} will be able to incrementally update an
+ID database much faster than it can build one from scratch. Until this
+feature becomes available, it might be a good idea to schedule a
+@file{cron} job to regularly update large ID databases during off-hours.
+
+@file{mkid} writes the ID file, therefore it accepts the @samp{--output}
+(and @samp{--file}) options as described in @ref{Writing options}.
+@file{mkid} extracts tokens from source files, therefore it accepts the
+@samp{--lang-map}, @samp{--include}, @samp{--exclude}, and
+@samp{--lang-option} options, as well as the language-specific scanner
+options, all of which are described in @ref{Extraction options}.
+@file{mkid} walks file trees, therefore it handles file and directory
+names on its command line and the @samp{--prune} option as described in
+@ref{Walker options}.
+
+In addition, @file{mkid} accepts the following command-line options:
+
+@table @samp
+
+@item -s
+@itemx --statistics
+@opindex -s
+@opindex --statistics
+@cindex statistics
+
+@file{mkid} reports statistics about resource usage at the end of its
+run.
+
+@item -v
+@itemx --verbose
+@opindex -v
+@opindex --verbose
+@cindex @file{mkid} progress
+
+@file{mkid} reports statistics about each file as it is scanned, and
+about the resource usage of its indexing algorithm at regular intervals.
+
+@end table
+
+@c ************* gkm *********************************************************
+@node lid invocation
+@chapter @code{lid}: Querying an ID Database by Token
+
+The @file{lid} program accepts @var{patterns} on the command line which
+it matches against the tokens stored in an ID database. The
+interpretation of a @var{pattern} is determined by the makeup of the
+@var{pattern} string itself, or can be overridden by command-line
+options. If a @var{pattern} contains regular expression meta-characters,
+it is used to perform a regular-expression substring search. If no such
+meta-characters are present, @var{pattern} is used to perform a literal
+word search. (By default, all searches are sensitive to alphabetic
+case.) If no @var{pattern} is supplied on the command line, @file{lid}
+lists every entry in the ID database.
+
+@file{lid} reads the ID database, therefore it accepts the @samp{--file}
+option, and consults the @samp{IDPATH} environment variable, as
+described in @ref{Reading options}. @file{lid} lists file names,
+therefore it accepts the @samp{--separator} option, as described in
+@ref{File listing options}.
+
+In addition, @code{lid} accepts the following command-line options:
+
+@table @samp
+
+@item -i
+@itemx --ignore-case
+@opindex -i
+@opindex --ignore-case
+@cindex alphabetic case, ignoring differences in
+@cindex ignoring differences in alphabetic case
+
+Ignoring differences in alphabetic case between the @var{pattern} and
+the tokens in the ID database.
+
+@item -l
+@itemx --literal
+@opindex -l
+@opindex --literal
+
+Match @var{pattern} as a literal string. Use this option if
+@var{pattern} contains regular-expression meta-characters, but you don't
+wish to perform a regular-expression search.
+
+@item -r
+@itemx --regexp
+@opindex -r
+@opindex --regexp
+
+Match @var{pattern} as an @emph{extended} regular expression@footnote{Extended
+regular expressions are the same as those accepted by @file{egrep}.}.
+Use this option if no regular-expression expression meta-characters are
+present in @var{pattern}, but you wish to force a regular-expression
+search (note: in this case, a @emph{literal substring} search might be
+faster).
+
+@item -w
+@itemx --word
+@opindex -w
+@opindex --word
+
+Match @var{pattern} using a word-delimited (non substring) search. This
+is the default for literal searches.
+
+@item -s
+@itemx --substring
+@opindex -s
+@opindex --substring
+
+Match @var{pattern} using a substring (non word-delimited) search. This
+is the default for regular expression searches.
+
+@item -k @var{style}
+@itemx --key=@var{style}
+@opindex -k
+@opindex --substring
+
+@var{Style} can be one of @samp{token}, @samp{pattern} or @samp{none}.
+This option controls how the subject of the query is presented. This is
+best illustrated by example:
+
+@example
+$ lid --key=token '^dest.'
+destaddr libsys/memcpy.c
+destination libsys/regex.c
+destlst libsys/rx.c
+destpos libsys/rx.c
+destset libsys/rx.h libsys/rx.c
+
+$ lid --key=pattern '^dest.'
+^dest. libsys/rx.h libsys/@{memcpy,regex,rx@}.c
+
+$ lid --key=none '^dest.'
+libsys/rx.h libsys/@{memcpy,regex,rx@}.c
+@end example
+
+When @samp{--key} is either @samp{token} or @samp{pattern}, the first
+column of output is a @var{token} or @var{pattern}, respectively. When
+@samp{--key} is @samp{none}, neither of these is printed, and the file
+name list begins immediately. The default is @samp{token}.
+
+@item -R @var{style}
+@itemx --result=@var{style}
+@opindex -R
+@opindex --result
+
+@var{Style} can be one of @samp{filenames}, @samp{grep}, @samp{edit} or
+@samp{none}. This option controls how the value associated with the
+query's @var{key} presented. When @var{style} is @samp{filenames}, a
+list of file names is printed (this is the default). When @var{style}
+is @samp{grep}, the lines that match @var{pattern} are printed in the
+same format as @samp{egrep -n}. When @var{style} is @samp{edit}, the
+file names are passed to an editor, and if possible @var{pattern} is
+passed as an initial search string (@pxref{eid invocation}). When
+@var{style} is @samp{none}, the file names are not processed in any way.
+This can be useful if you wish to see what tokens match a @var{pattern},
+but don't care about where they reside.
+
+@item -d
+@itemx -o
+@itemx -x
+@opindex -d
+@opindex -o
+@opindex -x
+@cindex radix of numeric matches, specifying
+@cindex numeric matches, specifying radix of
+
+These options may be used in any combination to specify the radix of
+numeric matches. @samp{-d} allows matching on decimal numbers,
+@samp{-o} on octal numbers, and @samp{-x} on hexadecimal numbers. Any
+combination of these options may be used. The default is to match all
+three radixes.
+
+@item -F @var{range}
+@itemx --frequency=@var{range}
+@opindex -F
+@opindex --frequency
+@cindex single matches, showing
+
+Match tokens whose occurrence count falls in @var{range}. @var{Range}
+may be expressed as a single number @var{n}, or as a range
+@var{n@code{..}m}. Either limit of the range may be omitted (e.g.,
+@var{@code{..}m}, or @var{n..@code{..}}). If the lower limit @var{n} is
+omitted, it defaults to @code{1}. If the upper limit is omitted, it
+defaults in the present implementation to @code{65535}, the maximum
+value of an unsigned 16-bit integer.
+
+Particularly useful queries are @samp{lid -F1}, which helps locate
+identifiers that are defined but never used, or are used but never
+defined. Similarly, @code{lid -F2} can help find functions that possess
+a prototype declaration and a definition, but are never called.
+
+@item -a @var{number}
+@itemx --ambiguous=@var{number}
+@opindex -a
+@opindex --ambiguous
+@cindex ambiguous identifier names, finding
+
+List identifiers (not numbers) that are ambiguous for the first
+@var{number} characters. This feature might be in useful when porting
+programs to ancient pea-brained compilers that don't support long
+identifier names. However, the best long-term option is to set such
+systems on fire.
+
+@end table
+
+@menu
+* lid aliases:: Aliases for specialized lid queries
+* Emacs gid interface:: GNU Emacs query interface
+* eid invocation:: Invoking an editor on query results
+@end menu
+
+@c ************* gkm *********************************************************
+@node lid aliases
+@section Aliases for Specialized @file{lid} Queries
+
+Historically, the ID utilities have provided several query interfaces
+which are specializations of @code{lid} (@pxref{lid invocation}).
+
+@table @file
+
+@item gid
+(alias for @samp{lid -R grep})
+lists all lines containing the requested pattern.
+
+@item eid
+(alias for @samp{lid -R edit})
+invokes an editor on all files containing the requested pattern, and
+optionally initiates a text search for that pattern.
+
+@item aid
+(alias for @samp{lid -ils}) treats the requested pattern
+as a case-insensitive literal substring.
+
+@end table
+
+@c ***************************************************************************
+@node Emacs gid interface
+@section GNU Emacs query interface
+
+@cindex Emacs interface to @code{gid}
+@flindex idutils.el @r{interface to Emacs}
+
+@vindex load-path
+The @code{idutils} source distribution comes with a file @file{idutils.el},
+which defines a GNU Emacs interface to @code{gid}. To install it, put
+@file{idutils.el} somewhere that Emacs will find it (i.e., in your
+@code{load-path}) and put
+
+@example
+(autoload 'gid "gid" nil t)
+@end example
+
+@noindent in one of Emacs' initialization files, e.g., @file{~/.emacs}.
+You will then be able to use @kbd{M-x gid} to run the command.
+
+@findex gid @r{Emacs function}
+The @code{gid} function prompts you with the word around point. If you
+want to search for something else, simply delete the line and type the
+pattern of interest.
+
+@flindex *compilation* @r{Emacs buffer}
+The function then runs the @code{gid} program in a @samp{*compilation*}
+buffer, so the normal @code{next-error} function can be used to visit
+all the places the identifier is found (@pxref{Compilation,,, emacs, The
+GNU Emacs Manual}).
+
+@c ************* gkm *********************************************************
+@node eid invocation
+@section @code{eid}: Invoking an Editor on Query Results
+
+@pindex eid
+
+@samp{lid -R edit} is an editing interface for the ID utilities that is
+most commonly used with @file{vi}. Emacs users should use the interface
+defined in @code{idutils.el} (@pxref{Emacs gid interface}). The ID
+utilities include an alias called @file{eid}, and for the sake of
+brevity, we'll use this alias for the remainder of this section.
+@file{eid} performs a @file{lid}-style, then asks if you wish to edit
+the files. If your query yields more than one line of output, you will
+be prompted after each line. This is the prompt you'll see:
+
+@example
+Edit? [y1-9^S/nq]
+@end example
+
+@noindent
+You may respond with:
+
+@table @samp
+
+@item y
+Edit all files listed.
+
+@item 1@dots{}9
+Edit all files starting at the @math{@var{n} + 1}'st file.
+
+@item /@var{string} @r{or} @kbd{CTRL-S}@var{regexp}
+Search into the file list, and begin editing with the first file name
+that matches the regular expression @var{regexp}.
+
+@item n
+Don't edit any files. If another line of query output is pending,
+advance to that line, for which another @samp{Edit?} prompt will appear.
+
+@item q
+Quit---don't edit any files, and don't process any more lines of query
+output.
+
+@end table
+
+Here is an example:
+
+@example
+prompt$ eid FILE \^print
+FILE @{ansi2knr,fid,filenames,idfile,idx,lid,misc,@dots{}@}.c
+Edit? [y1-9^S/nq] n
+^print @{ansi2knr,fid,getopt,getopt1,lid,mkid,regex,scanners@}.c
+Edit? [y1-9^S/nq] 2
+@end example
+
+@noindent This will start editing at @file{getopt}.c.
+
+@code{eid} invokes the editor defined by the environment variable
+@samp{VISUAL}. If @samp{VISUAL} is undefined, it uses the environment
+variable @samp{EDITOR} instead. If @samp{EDITOR} is undefined, it
+defaults to @file{vi}. It is possible for @file{eid} to pass the editor
+an initial search pattern so that your cursor will immediately alight on
+the token of interest. This feature is controlled by the following
+environment variables:
+
+@table @samp
+
+@item EIDARG
+@vindex EIDARG
+@cindex search for token, initial
+A printf(3) format string for the editor argument to search for the
+matching token. For @code{vi}, this should be @samp{+/%s/}.
+
+@item EIDLDEL
+@vindex EIDLDEL
+@cindex left delimiter editor argument
+@cindex beginning-of-word editor argument
+The regular-expression meta-character(s) for delimiting the beginning of
+a word (the `@file{eid} Left DELimiter'). @code{eid} inserts this in
+front of the matching token when a word-search is desired. For
+@file{vi}, this should be @samp{\<}.
+
+@item EIDRDEL
+@vindex EIDRDEL
+@cindex right delimiter editor argument
+@cindex end-of-word editor argument
+The regular-expression meta-character(s) for delimiting the end of
+a word (the `@file{eid} Right DELimiter'). @code{eid} inserts this in
+end of the matching token when a word-search is desired. For
+@file{vi}, this should be @samp{\>}.
+
+@end table
+
+@c ************* gkm *********************************************************
+@node fid invocation
+@chapter @code{fid}: Listing a file's tokens
+
+@pindex fid
+@cindex tokens in a file
+@cindex tokens common to two files
+
+@file{fid} prints the tokens found in a given file. If two file names
+are passed on the command line, @file{fid} prints the tokens that are
+common to both files (i.e., the @emph{set intersection} of the two token
+sets).
+
+@file{fid} reads the ID database, therefore it accepts the @samp{--file}
+option, and consults the @samp{IDPATH} environment variable, as
+described in @ref{Reading options}.
+
+If the standard output is attached to a terminal, the printed tokens are
+separated by spaces. Otherwise, the tokens are printed one per line.
+
+@c ************* gkm *********************************************************
+@node fnid invocation
+@chapter @code{fnid}: Looking up filenames
+
+@pindex fnid
+@cindex filenames, matching
+@cindex matching filenames
+
+@file{fnid} queries the list of file names stored in the ID database.
+It accepts shell @emph{wildcard} patterns on the command line. If no
+pattern is supplied, @file{*} is implied. @file{fnid} prints the
+file names that match the given patterns.
+
+@file{fnid} prints file names, and as such accepts the
+@samp{--separator} option as described in @ref{File listing options}.
+
+For example, the command:
+
+@example
+fnid \*.c
+@end example
+
+@noindent lists all the @file{.c} files in the database. (The @samp{\}
+here protects the @samp{*} from being expanded by the shell.)
+
+@c ************* gkm *********************************************************
+@node xtokid invocation
+@chapter @file{xtokid}: Testing Language Scanners
+
+@file{xtokid} accepts the names of files and/or directories on the
+command line, then extracts and prints a stream of tokens from those
+files for which it has a valid, enabled scanner. This is useful
+primarily for debugging new @file{mkid} scanners (@pxref{Defining
+scanners}).
+
+@file{xtokid} extracts tokens from source files, therefore it accepts
+the @samp{--lang-map}, @samp{--include}, @samp{--exclude}, and
+@samp{--lang-option} options, as well as the language-specific scanner
+options, all of which are described in @ref{Extraction options}.
+@file{xtokid} walks file trees, therefore it handles file and directory
+names on its command line and the @samp{--prune} option as described in
+@ref{Walker options}.
+
+The name @samp{xtokid} indicates that it is the ``eXtract TOKens ID
+utility''.
+
+@c ************* gkm *********************************************************
+@node Past and Future
+@chapter Past and Future
+
+@cindex history
+
+@pindex look @r{and @file{mkid} 1}
+@cindex McGary, Greg
+Greg McGary conceived of the ideas behind the ID utilities when he
+began working on the Unix kernel in 1984. He needed a navigation tool
+to help him find his way around the expansive, unfamiliar landscape.
+The first @code{idutils}-like tools were shell scripts, and produced an
+ASCII database that looks much like the output of @samp{lid ".*"}. It
+took over an hour on a @sc{vax 11/750} to build a database for a
+@sc{4.1bsd} derived kernel. The first version of @file{lid} used the
+@sc{unix} system utility @code{look}, modified to handle very long
+lines.
+
+In 1986, Greg rewrote the shell scripts in C to improve performance.
+Build times for the ID file were shortened by an order of magnitude.
+The ID utilities were first posted to @samp{comp.sources.unix} in
+September 1987 under the name @code{id}.
+
+@cindex Horsley, Tom
+@cindex Scofield, Doug
+@cindex Leonard, Bill
+@cindex Berry, Karl
+Over the next few years, several versions diverged from the original
+source. Tom Horsley at Harris Computer Systems Division stepped forward
+to take over maintenance and integrated some of the fixes from divergent
+versions. A first release of the renamed @file{mkid} @w{version 2} was
+posted to @file{alt.sources} near the end of 1990. At that time, Tom
+wrote a Texinfo manual with the encouragement of the net community.
+(Tom especially thanks Doug Scofield and Bill Leonard whom he dragooned
+into helping proofread and edit---they found several problems in the
+initial version.) Karl Berry revamped the manual for Texinfo style,
+indexing, and organization in 1995.
+
+In January 1995, Greg McGary reemerged as the primary maintainer and
+launched development of @file{mkid} version 3, whose primary new feature
+is an efficient algorithm for building databases that is linear in both
+time and space over the size of the input text. (The old algorithm was
+quadratic in space so it was incapable of handling very large source
+trees.) For the first time, the code was released under the GNU Public
+License.
+
+In June 1996, the package was renamed again to @code{id-utils} and was
+released for the first time under FSF copyright as part of the GNU
+system. All programs had their command-line arguments completely
+revised. The @file{mkid} and @file{xtokid} programs also gained a
+file-tree walker, so that directory names can be passed on the command
+line instead of the names of every individual file. Greg reorganized
+and rewrote most of the Texinfo manual to reflect these changes.
+
+In 2006, package name had a minor change from @code{id-utils} to
+@code{idutils}, to be more consistent with the other GNU package names.
+
+@pindex cscope
+@pindex grep
+@cindex future
+Future releases of @code{idutils} might include:
+
+@table @bullet
+
+an optional coupling with GNU @code{grep}, so that @code{grep} can use
+an ID database for hints
+
+a @code{cscope} work-alike query interface
+
+incremental update of the ID database.
+
+@end table
+
+@c ***************************************************************************
+@node Index
+@unnumbered Index
+
+@printindex cp
+
+@contents
+@bye