F is used as a kind letter for file kind in Exuberant-ctags; the
F was hard-coded in ctags internal. However, we found some built-in
parsers including Ruby uses F for their own purpose. So if you
find a tag having F as a kind letter, you cannot say what it is
well: a file name or something peculiar in the language. Long kind
description strings may help you but we are not sure all tools
utilizing tags file refer the long kind description strings.

Universal-ctags disallows parsers to use F their own purpose
in both built-in and optlib parsers.

Japanese programmers sometimes use the Japanese language in source
code comments. Of course, it is not limited to Japanese. People may
use their own native language and in such cases encoding becomes an
issue.

ctags doesn’t consider the input encoding; it just reads input as a
sequence of bytes and uses them as is when writing tags entries.

On the other hand Vim does consider input encoding. When loading a
file, Vim converts the file contents into an internal format with one
of the encodings specified in its fileencodings option.

As a result of this difference, Vim cannot always move the cursor to
the definition of a tag as users expect when attempting to match the
patterns in a tags file.

The good news is that there is a way to notify Vim of the encoding
used in a tags file with the TAG_FILE_ENCODING pseudo tag.

Two new options have been introduced (--input-encoding=IN and
--output-encoding=OUT).

Using the encoding specified with these options ctags converts input
from IN to OUT. ctags uses the converted strings when writing
the pattern parts of each tag line. As a result the tags output is
encoded in OUT encoding.

In addition OUT is specified at the top the tags file as the
value for the TAG_FILE_ENCODING pseudo tag. The default value of
OUT is UTF-8.

NOTE: Converted input is NOT passed to language parsers.
The parsers still deal with input as a byte sequence.

With --input-encoding-<LANG>=IN, you can specify a specific input
encoding for LANG. It overrides the global default value given
with --input-encoding.

The example usage can be found in Tmain/{input,output}-encoding-option.d.

Acceptable IN and OUT values can be listed with iconv -l or
iconv –list. It is platform dependant.

To enable the option, libiconv is needed on your platform. In addition
--enable-iconv must be given to configure before making ctags.
On Windows mingw32, you must specify WITH_ICONV=yes like this:

Exuberant-ctags provides a way to inspect its internals via
--list-kinds, --list-languages, and --list-maps.

This idea has been expanded in Universal-ctags with
--list-kinds-full, --list-map-extensions, --list-extras,
--list-features, --list-fields, --list-map-patterns, and
--list-pseudo-tags being added.

The original three --list- options are not changed for
compatibility reasons, however, the newly introduced options are
recommended for all future use.

By default, interactive use is assumed and ctags tries aligning the
list output in columns for easier reading.

When --machinable is given before a --list- option, ctags
outputs the list in a format more suitable for processing by scripts.
Tab characters are used as separators between columns. The alignment
of columns is never considered when --machinable is given.

Currently only --list-extras, --list-fields and
--list-kinds-full support --machinable output.

These new --list- options also print a column header, a line
representing the name of each column. The header may help users and
scripts to understand and recognize the columns. Ignoring the column
header is easy because it starts with a # character.

In Universal-ctags, as in Exuberant-ctags, most kinds are parser
local; enabling (or disabling) a kind in a parser has no effect on
kinds in any other parsers even those with the same name and/or
letter.

However, there are exceptions, such as C and C++ for example. C++ can
be considered a language extended from C. Therefore it is natural
that all kinds defined in the C parser are also defined in the C++
parser. Enabling a kind in the C parser also enables a kind having
the same name in the C++ parser, and vice versa.

A kind group is a group of kinds satisfying the following conditions:

Having the same name and letter, and

Being synchronized with each other

A master parser manages the synchronization of a kind group. The
MASTER column of --list-kinds-full shows the master parser of
the kind.

Internally, a state change (enabled or disabled with
--kind-<LANG>=[+|-]...) of a kind in a kind group is reported to
its master parser as an event. Then the master parser updates the
state of all kinds in the kind group as specified with the option.

Though language FOO is added before BAR, only BAR is set as a
handler for the spec *.ABC.

Universal-ctags enables multiple parsers to be configured for a spec.
The appropriate parser for a given input file can then be chosen by a
variety of internal guessing strategies (see “Choosing a proper
parser in ctags”).

Even if “yes” is specified as an option argument for –tag-relative,
absolute paths are used in tags output if an input is given as
an absolute path. This behavior is expected in exuberant-ctags
as written in its man-page.

In addition to “yes” and “no”, universal-ctags takes “never” and “always”.

If “never” is given, absolute paths are used in tags output regardless
of the path representation for input file(s). If “always” is given,
relative paths are used always.

-D emulates the behaviour of the corresponding gcc option:
it defines a C preprocessor macro. All types of macros are supported,
including the ones with parameters and variable arguments.
Stringification, token pasting and recursive macro expansion are also supported.

-I is now simply a backward-compatible syntax to define a
macro with no replacement.

Some examples follow.

$ ctags ... -D IGNORE_THIS ...

With this commandline the following C/C++ input

intIGNORE_THISa;

will be processed as if it was

inta;

Defining a macro with parameters uses the following syntax:

$ ctags ... -D "foreach(arg)=for(arg;;)" ...

This example defines for(arg;;) as the replacement foreach(arg).
So the following C/C++ input

foreach(char*p,pointers){}

is processed in new C/C++ parser as:

for(char*p;;){}

and the p local variable can be extracted.

The previous commandline includes quotes since the macros generally contain
characters that are treated specially by the shells. You may need some escaping.

Token pasting is performed by the ## operator, just like in the normal
C preprocessor.

With base long flag of –langdef=<LANG> option, you can define
a subparser for a specified base parser. Combining with --kinddef-<LANG>
and --regex-<KIND> options, you can extend an existing parser
without risk of kind confliction.

Using only –regex-C=... you can capture setpriority.
However, there were concerns about kind confliction; when introducing
a new kind with –regex-C=..., you cannot use a letter and name already
used in C parser and –regex-C=... options specified in the other places.

You can use a newly defined subparser as a new namespace of kinds.
In addition you can enable/disable with the subparser usable
–languages=[+|-] option:

To prevent generating overly large tags files, a pattern field is
truncated, by default, when its size exceeds 96 bytes. A different
limit can be specified with --pattern-length-limit=N.

An input source file with long lines and multiple tag matches per
line can generate an excessively large tags file with an
unconstrained pattern length. For example, running ctags on a
minified JavaScript source file often exhibits this behaviour.

Traditionally ctags collects the information for locating where a
language object is DEFINED.

In addition Universal-ctags supports reference tags. If the extra-tag
r is enabled, Universal-ctags also collects the information for
locating where a language object is REFERENCED. This feature was
proposed by @shigio in #569 for GNU GLOBAL.

A reference tag may have “role” information representing how it is
referenced. Universal-ctags prints the role information when the r
field is enabled with --fields=+r. If a tag doesn’t have a
specialized role, generic is used as the name of role.

The Reference tag marker field, R, is a specialized GNU global
requirement; D is used for the traditional definition tags, and R is
used for the new reference tags. The field can be used only with
--_xformat.

Although the facility for collecting reference tags is implemented,
only a few parsers currently utilize it. All available roles can be
listed with --list-roles:

$ ./ctags --list-roles
#LANGUAGE KIND(L/N) NAME ENABLED DESCRIPTION
SystemdUnit u/unit Requires on referred in Requires keySystemdUnit u/unit Wants on referred in Wants keySystemdUnit u/unit After on referred in After keySystemdUnit u/unit Before on referred in Before keySystemdUnit u/unit RequiredBy on referred in RequiredBy keySystemdUnit u/unit WantedBy on referred in WantedBy keyYaml a/anchor alias on aliasDTD e/element attOwner on attributes ownerAutomake c/condition branched on used for branchingCobol S/sourcefile copied on copied in source fileMaven2 g/groupId dependency on dependencyDTD p/parameterEntity elementName on element namesDTD p/parameterEntity condition on conditionsLdScript s/symbol entrypoint on entry pointsLdScript i/inputSection discarded on discarded when linking...

The first column shows the name of the parser.
The second column shows the letter/name of the kind.
The third column shows the name of the role.
The fourth column shows whether the role is enabled or not.
The fifth column shows the description of the role.

Currently ctags doesn’t provide the way for disabling a
specified role.

When guessing a proper parser for a given input file, Exuberant-ctags
tests file name patterns AFTER file extensions (e-order).
Universal-ctags does this differently; it tests file name patterns
BEFORE file extensions (u-order).

This incompatible change is introduced to deal with the following
situation: “build.xml” is an input file. The Ant parser declares it
handles a file name pattern “build.xml” and another parser, Foo,
declares it handles a file extension “xml”.

Which parser should be used for parsing the input? The user may want
to use the Ant parser because the pattern it declares is more
specific than the extension Foo declares. However, in e-order, the
other parser, Foo, is chosen.

So Universal-ctags uses the u-order even though it introduces an
incompatibility.

This is a newly introduced pseudo tag. It is not emitted by default.
It is emitted only when --pseudo-tags=+TAG_KIND_SEPARATOR is
given.

This is for describing separators placed between two kinds in a
language.

Tag entries including the separators are emitted when --extras=+q
is given; fully qualified tags contain the separators. The separators
are used in scope information, too.

ctags emits TAG_KIND_SEPARATOR with following format:

!_TAG_KIND_SEPARATOR!{parser} {sep} /{upper}{lower}/

or

!_TAG_KIND_SEPARATOR!{parser} {sep} /{lower}/

Here {parser} is the name of language. e.g. PHP.
{lower} is the letter representing the kind of the lower item.
{upper} is the letter representing the kind of the upper item.
{sep} is the separator placed between the upper item and the lower
item.

The format without {upper} is for representing a root separator. The
root separator is used as prefix for an item which has no upper scope.

* given as {upper} is a fallback wild card; if it is given, the
{sep} is used in combination with any upper item and the item
specified with {lower}.

Each backslash character used in {sep} is escaped with an extra
backslash character.

A parser own field only has a long name, no letter. For
enabling/disabling such fields, the name must be passed to
--fields-<LANG>.

e.g. for enabling the sectionMarker field owned by the
reStructuredText parser, use the following command line:

$ ./ctags --fields-reStructuredText=+{sectionMarker} ...

The wild card notation can be used for enabling/disabling parser own
fields, too. The following example enables all fields owned by the
C++ parser.

$ ./ctags --fields-C++='*' ...

* can also be used for specifying languages.

The next example is for enabling end fields for all languages which
have such a field.

$ ./ctags --fields-'*'=+'{end}' ...
...

In this case, using wild card notation to specify the language, not
only fields owned by parsers but also common fields having the name
specified (end in this example) are enabled/disabled.

Using the wild card notation to specify the language is helpful to
avoid incompatibilities between versions of Universal-ctags itself
(SELF INCOMPATIBLY).

In Universal-ctags development, a parser developer may add a new
parser own field for a certain language. Sometimes other developers
then recognize it is meaningful not only for the original language
but also other languages. In this case the field may be promoted to a
common field. Such a promotion will break the command line
compatibility for --fields-<LANG> usage. The wild card for
<LANG> will help in avoiding this unwanted effect of the promotion.

With respect to the tags file format, nothing is changed when
introducing parser own fields; <fieldname>:<value> is used as
before and the name of field owner is never prefixed. The language:
field of the tag identifies the owner.

When disabled the name it’s_ok_to_be_correct is not included in the
tags output. In other words, the name it’s_ok_to_be_correct is
derived from the name it’s ok to be correct when the extra flag is
enabled.

The question is what are extra tag entries. As far as I know none has
answered explicitly. I have two ideas in Universal-ctags. I
write “ideas”, not “definitions” here because existing parsers don’t
follow the ideas. They are kept as is in variety reasons but the
ideas may be good guide for people who wants to write a new parser
or extend an exiting parser.

The first idea is that a tag entry whose name is appeared in the input
file as is, the entry is NOT an extra. (If you want to control the
inclusion of such entries, the classical --kind-<LANG>=[+|-]... is
what you want.)

Qualified tags, whose inclusion is controlled by --extras=+q, is
explained well with this idea.
Let’s see an example:

Foo and func are in input.py. So they are no extra tags. In
other hand, Foo.func is not in input.py as is. The name is
generated by ctags as a qualified extra tag entry.
whitespaceSwapped extra flag of Robot parser is also aligned well
on the idea.

In this example operator+ is in input.cc.
In other hand, operator + is in the ctags output as non extra tag entry.
See a whitespace between the keyword operator and + operator.
This is an exception of the first idea.

The second idea is that if the inclusion of a tag cannot be
controlled well with --kind-<LANG>=[+|-]..., the tag may be an
extra.

Function foo of C language is included only when F extra flag
is enabled. Both foo and bar are functions. Their inclusions
can be controlled with f kind of C language: --kind-C=[+|-]f.

The difference between static modifier or implicit extern modifier in
a function definition is handled by F extra flag.

Basically the concept kind is for handling the kinds of language
objects: functions, variables, macros, types, etc. The concept extra
can handle the other aspects like scope (static or extern).

However, a parser developer can take another approach instead of
introducing parser own extra; one can prepare staticFunction and
exportedFunction as kinds of one’s parser. The second idea is a
just guide; the parser developer must decide suitable approach for the
target language.

Anyway, in the second idea, --extra is for controlling inclusion
of tags. If what you want is not about inclusion, --param-<LANG>
can be used as the last resort.

The notation for FORMAT is similar to that employed by printf(3) in
the C language; % represents a slot which is substituted with a
field value when printing. You can specify multiple slots in FORMAT.
Here field means an item listed with -list-fields option.

The notation of a slot:

%[WIDTH-AND-ADJUSTMENT]FIELD-SPECIFIER

FIELD-SPECIFIER specifies a field whose value is printed.
Short notation and long notation are available. They can be mixed
in a FORMAT. Specifying a field with either notation, one or more
fields are activated internally.

The short notation is just a letter listed in the LETTER column of
the --list-fields output.

The long notation is a name string surrounded by braces({ and
}). The name string is listed in the NAME column of the output of
the same option. To specify a field owned by a parser, prepend
the parser name to the name string with . as a separator.

Wild card (*) can be used where a parser name is specified. In this
case both common and parser own fields are activated and printed.
If a common field and a parser own field have the same name,
the common field has higher priority.

WIDTH-AND-ADJUSTMENT is a positive or negative number.
The absolute value of the number is used as the width of
the column where a field is printed. The printing is
right adjusted when a positive value is given, and left
adjusted when negative.

For a ctags binary that had debugging output enabled in the build config
stage, -D was used for specifying the level of debugging
output. It is changed to -d. This change is not critical because
-D option was not described in ctags.1 man page.

The concept of filtering is inspired by the display filter of
Wireshark. You can specify more complex conditions for searching.
Currently this feature is available only on platforms where
fmemopen is available as part of libc. Filtering in readtags is an
experimental feature.

The syntax of filtering rules is based on the Scheme language, a
variant of Lisp. The language has prefix notation and parentheses.

Before printing an entry from the tags file, readtags evaluates an
expression (S expression or sexp) given as an option argument to
-Q. As the result of the evaluation, readtags gets a value. false
represented as #f, indicates rejection: readtags doesn’t print it.

All symbols starting with $ represent a field of a tag entry which
is being tested against the S expression. Most will evaluate as a
string or #f. It evaluates to #f when the field doesn’t exist.
$inherits is evaluated to a list of strings if the entry has an
inherits field. The scope field holds structured data: the kind
and name of the upper scope combined with :. The kind part is
mapped to $scope-kind, and the name part to $scope-name.

$scope-kind and $scope-name can only be used if the input tags
file is generated by ctags with --fields=+Z.

All symbols not prefixed with $ are operators. When using these,
put them at the head(car) of list. The rest(cdr) of the list is
passed to the operator as arguments. Many of them are also available
in the Scheme language; see the other documents.

prefix?, suffix?, and substr? may only be available in this
implementation. All of them take two strings. The first one
is called the target.

The exception in the above naming convention is the $ operator.
$ is a generic accessor for accessing extension fields.
$ takes one argument: the name of an extension field.
It returns the value of the field as a string if a value
is given, or #f.