= Proposal: Add 'text' to the Haskell Platform =
Proposal Author: Don Stewart
Maintainer: Bryan O'Sullivan (submitted with his approval)
== Introduction ==
This is a proposal for the 'text' package to be included in the next
major release of the Haskell platform.
Everyone is invited to review this proposal, following the standard
procedure for proposing and reviewing packages.
http://trac.haskell.org/haskell-platform/wiki/AddingPackages
Review comments should be sent to the libraries mailing list by
October 1 so that we have time to discuss and resolve issues
before the final deadline for a call for consensus in early November.
http://trac.haskell.org/haskell-platform/wiki/ReleaseTimetable
== Credits ==
Proposal author and package maintainer: Bryan O'Sullivan, originally by
Tom Harper, based on ByteString and Vector (fusion) packages.
The following individuals contributed to the review process: Don
Stewart, Johan Tibell
== Abstract ==
The 'text' package provides an efficient packed, immutable Unicode text type
(both strict and lazy), with a powerful loop fusion optimization framework.
The 'Text' type represents Unicode character strings, in a time and
space-efficient manner. This package provides text processing
capabilities that are optimized for performance critical use, both
in terms of large data quantities and high speed.
The 'Text' type provides character-encoding, type-safe case
conversion via whole-string case conversion functions. It also
provides a range of functions for converting Text values to and from
'ByteStrings', using several standard encodings (see the 'text-icu'
package for a much larger variety of encoding functions).
Efficient locale-sensitive support for text IO is also supported.
This module is intended to be imported qualified, to avoid name
clashes with Prelude functions, e.g.
{{{
import qualified Data.Text as T
}}}
Documentation and tarball from the hackage page:
{{{
http://hackage.haskell.org/package/text
}}}
Development repo:
{{{
darcs get http://code.haskell.org/text/
}}}
All [http://trac.haskell.org/haskell-platform/wiki/AddingPackages#Packagerequirements package requirements] are met.
== Rationale ==
While Haskell's Char type is capable of reprenting Unicode code points, the
String sequence of such Chars has some drawbacks that prevent is general
use:
1. unicode-unaware case conversion (map toUpper is an unsafe case conversion)
2. the representation is space inefficient.
3. the data structure is element-level lazy, whereas a number of
applications require either some level of additional strictness
An intermediate solution to these was via 'Data.ByteString' (an
efficient byte sequence type, that addresses points 2 and 3), which,
when used in conjunction with utf8-string, provides very simple
non-latin1 encoding support (though with significant drawbacks in terms
of locale and encoding range).
The 'text' package addresses these shortcomings in a number of way:
1. support whole-string case conversion (thus, type correct unicode
transformations)
2. a space and time efficient representation, based on unboxed Word16
arrays
3. either fully strict, or chunk-level lazy data types (in the style of
Data.ByteString)
4. full support for locale-sensitive, encoding-aware IO.
The 'text' library has rapidly become popular for a number of
applications, and is used by more than 50 other Hackage packages. As of
Q2 2010, 'text' is ranked 27/2200 libraries (top 1% most popular),
in particular, in web programming. It is used by:
* the blaze html pretty printing library
* the hstringtemplate file templating library
* *all* popular web frameworks: happstack, snap, salvia and yesod web frameworks
* the hexpat and libxml xml parsers
The design is based on experience from Data.Vector and Data.ByteString:
* the underlying type is based on unpinned, packed arrays on the Haskell heap
with an ST interface for memory effects.
* pipelines of operations are optimized via converstion to and from the
'stream' abstraction[1]
A large testsuite, with [http://code.haskell.org/~dons/tests/text/ coverage data, is provided].
== The API ==
The API is broken into several logical pieces, which are
self-explanatory:
* combinators for operating on strict, abstract 'text's.
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.html
* an equivalent API for chunk-element lazy 'text's.
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-Lazy.html
* encoding transformations, to and from bytestrings:
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-Encoding.html
* support for conversion to Ptr Word16:
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-Foreign.html
* locale-aware IO layer:
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-IO.html
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-Lazy-IO.html
== Design decisions ==
* IO and pure combinators are in separate modules.
* Both a fully strict, and partially-strict type are provided.
* The underlying optimization framework is stream fusion, (not build/foldr), and is hidden from the user.
* Unpinned arrays are used, to prevent fragmentation.
* Large numbers of additional encodings are delegated to the text-icu package.
* An 'IsString' instance is provided for String literals.
* The implementation is OS and architecture neutral (portable).
* The implementation uses a number of language extensions:
{{{
CPP
MagicHash
UnboxedTuples
BangPatterns
Rank2Types
RecordWildCards
ScopedTypeVariables
ExistentialQuantification
DeriveDataTypeable
}}}
* The implementation is entirely Haskell (no additional C code or libraries).
* The package provides a QuickCheck/HUnit testsuite, and coverage data.
* The package adds no new dependencies to the HP.
* The package builds with the Simple cabal way.
* There is no existing functionality for packed unicode text in the HP.
* The package has complexity annotations.
== Open issues ==
1. The `text-icu` package is not part of this proposal, as adding it would make the platform depend on the ICU C library. ''This is not a blocker.''
2. Both the `text` package and the `base` package [http://www.haskell.org/pipermail/libraries/2010-September/014196.html provide Unicode encoding/decoding functionality]. Perhaps some of this functionality could be merged. ''This cannot be achieved until the base library makes some types non-abstract. This is not a blocker.''
3. Naming inconsistencies between bytestring, text and list. Some functions [http://www.haskell.org/pipermail/libraries/2010-October/014614.html have similar names to functions] in the [http://hackage.haskell.org/package/bytestring bytestring] package but have different types (other than `ByteString` vs `Text`.) Some functions have the same type but different names.
4. Do we need both a strict and lazy version of `Text`? The strict version needs one less indirection, can be unpacked in function arguments and takes less space when stored in data types. ''The performance difference is substantial, and the non-strict version can stream large quantities of data in a small footprint in a way not possible with the strict kind. Not a blocker.''
=== On the naming issue ===
* One proposal on how to [http://www.haskell.org/pipermail/libraries/2010-October/014659.html fix the names], and the [http://www.haskell.org/pipermail/libraries/2010-October/014674.html author's response]
* A call for further discussion on the [http://www.haskell.org/pipermail/libraries/2010-November/014901.html name/type matching issue].
The package maintainer has [http://www.haskell.org/pipermail/libraries/2010-November/015080.html proposed] an updated API. The substring functions are now named:
{{{
breakOn :: Text -> Text -> (Text, Text)
breakOnEnd :: Text -> Text -> (Text, Text)
breakOnAll :: Text -> Text -> [(Text, Text)]
splitOn :: Text -> Text -> [Text]
}}}
The character predicate functions now match the List names:
{{{
break :: (Char -> Bool) -> Text -> (Text, Text)
span :: (Char -> Bool) -> Text -> (Text, Text)
partition :: (Char -> Bool) -> Text -> (Text, Text)
find :: (Char -> Bool) -> Text -> Maybe Char
split :: (Char -> Bool) -> Text -> [Text]
}}}
The `count` function remains unchanged, but there is the suggestion that the `bytestring` version of `count` could be generalised instead
{{{
count :: Text -> Text -> Int
}}}
''History'': updated [http://www.haskell.org/pipermail/libraries/2010-October/014532.html Oct 4, by johan], Nov 6 by dons and [http://www.haskell.org/pipermail/libraries/2010-November/015080.html Nov 16 by duncan].
== Notes ==
The implementation consists of 30 modules, and relies on cabal's package
hiding mechanism to expose only 5 modules. The implementation is around
8000 lines of text total.
The public modules expose none of these (?).
The Python standard library provides both a string and a unicode
sequence type. These are somewhat analogous to the
ByteString/String/Text split.
= References =
[1]: "Stream Fusion: From Lists to Streams to Nothing at All", Coutts, Leshchinskiy and Stewart, ICFP 2007, Freiburg, Germany.