| |
| |
| .. function:: register(search_function) |
| |
| Register a codec search function. Search functions are expected to take one |
| argument, the encoding name in all lower case letters, and return a |
| :class:`CodecInfo` object having the following attributes: |
| |
n | * ``name`` The name of the encoding; |
n | * ``name`` The name of the encoding; |
| |
n | * ``encoder`` The stateless encoding function; |
n | * ``encode`` The stateless encoding function; |
| |
n | * ``decoder`` The stateless decoding function; |
n | * ``decode`` The stateless decoding function; |
| |
n | * ``incrementalencoder`` An incremental encoder class or factory function; |
n | * ``incrementalencoder`` An incremental encoder class or factory function; |
| |
n | * ``incrementaldecoder`` An incremental decoder class or factory function; |
n | * ``incrementaldecoder`` An incremental decoder class or factory function; |
| |
n | * ``streamwriter`` A stream writer class or factory function; |
n | * ``streamwriter`` A stream writer class or factory function; |
| |
n | * ``streamreader`` A stream reader class or factory function. |
n | * ``streamreader`` A stream reader class or factory function. |
| |
| The various functions or classes take the following arguments: |
| |
n | *encoder* and *decoder*: These must be functions or methods which have the same |
n | *encode* and *decode*: These must be functions or methods which have the same |
| interface as the :meth:`encode`/:meth:`decode` methods of Codec instances (see |
| Codec Interface). The functions/methods are expected to work in a stateless |
| mode. |
| |
n | *incrementalencoder* and *incrementalencoder*: These have to be factory |
n | *incrementalencoder* and *incrementaldecoder*: These have to be factory |
| functions providing the following interface: |
| |
| ``factory(errors='strict')`` |
| |
| The factory functions must return objects providing the interfaces defined by |
n | the base classes :class:`IncrementalEncoder` and :class:`IncrementalEncoder`, |
n | the base classes :class:`IncrementalEncoder` and :class:`IncrementalDecoder`, |
| respectively. Incremental codecs can maintain state. |
| |
| *streamreader* and *streamwriter*: These have to be factory functions providing |
| the following interface: |
| |
| ``factory(stream, errors='strict')`` |
| |
| The factory functions must return objects providing the interfaces defined by |
| Implements the ``replace`` error handling. |
| |
| |
| .. function:: ignore_errors(exception) |
| |
| Implements the ``ignore`` error handling. |
| |
| |
n | .. function:: xmlcharrefreplace_errors_errors(exception) |
n | .. function:: xmlcharrefreplace_errors(exception) |
| |
| Implements the ``xmlcharrefreplace`` error handling. |
| |
| |
n | .. function:: backslashreplace_errors_errors(exception) |
n | .. function:: backslashreplace_errors(exception) |
| |
| Implements the ``backslashreplace`` error handling. |
| |
| To simplify working with encoded files or stream, the module also defines these |
| utility functions: |
| |
| |
| .. function:: open(filename, mode[, encoding[, errors[, buffering]]]) |
| |
| Open an encoded file using the given *mode* and return a wrapped version |
n | providing transparent encoding/decoding. |
n | providing transparent encoding/decoding. The default file mode is ``'r'`` |
| meaning to open the file in read mode. |
| |
| .. note:: |
| |
| The wrapped version will only accept the object format defined by the codecs, |
| i.e. Unicode objects for most built-in codecs. Output is also codec-dependent |
| and will usually be Unicode as well. |
n | |
| .. note:: |
| |
| Files are always opened in binary mode, even if no binary mode was |
| specified. This is done to avoid data loss due to encodings using 8-bit |
| values. This means that no automatic conversion of ``'\n'`` is done |
| on reading and writing. |
| |
| *encoding* specifies the encoding which is to be used for the file. |
| |
| *errors* may be given to define the error handling. It defaults to ``'strict'`` |
| which causes a :exc:`ValueError` to be raised in case an encoding error occurs. |
| |
| *buffering* has the same meaning as for the built-in :func:`open` function. It |
| defaults to line buffered. |
| |
| *errors* may be given to define the error handling. It defaults to ``'strict'``, |
| which causes :exc:`ValueError` to be raised in case an encoding error occurs. |
| |
| |
| .. function:: iterencode(iterable, encoding[, errors]) |
| |
| Uses an incremental encoder to iteratively encode the input provided by |
n | *iterable*. This function is a generator. *errors* (as well as any other keyword |
n | *iterable*. This function is a :term:`generator`. *errors* (as well as any |
| argument) is passed through to the incremental encoder. |
| other keyword argument) is passed through to the incremental encoder. |
| |
| .. versionadded:: 2.5 |
| |
| |
| .. function:: iterdecode(iterable, encoding[, errors]) |
| |
| Uses an incremental decoder to iteratively decode the input provided by |
n | *iterable*. This function is a generator. *errors* (as well as any other keyword |
n | *iterable*. This function is a :term:`generator`. *errors* (as well as any |
| argument) is passed through to the incremental encoder. |
| other keyword argument) is passed through to the incremental decoder. |
| |
| .. versionadded:: 2.5 |
| |
| The module also provides the following constants which are useful for reading |
| and writing to platform dependent files: |
| |
| |
| .. data:: BOM |
| |
| All incremental encoders must provide this constructor interface. They are free |
| to add additional keyword arguments, but only the ones defined here are used by |
| the Python codec registry. |
| |
| The :class:`IncrementalEncoder` may implement different error handling schemes |
| by providing the *errors* keyword argument. These parameters are predefined: |
| |
n | * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. |
n | * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. |
| |
n | * ``'ignore'`` Ignore the character and continue with the next. |
n | * ``'ignore'`` Ignore the character and continue with the next. |
| |
n | * ``'replace'`` Replace with a suitable replacement character |
n | * ``'replace'`` Replace with a suitable replacement character |
| |
n | * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference |
n | * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference |
| |
n | * ``'backslashreplace'`` Replace with backslashed escape sequences. |
n | * ``'backslashreplace'`` Replace with backslashed escape sequences. |
| |
| The *errors* argument will be assigned to an attribute of the same name. |
| Assigning to this attribute makes it possible to switch between different error |
| handling strategies during the lifetime of the :class:`IncrementalEncoder` |
| object. |
| |
| The set of allowed values for the *errors* argument can be extended with |
| :func:`register_error`. |
| |
| |
n | .. method:: IncrementalEncoder.encode(object[, final]) |
n | .. method:: encode(object[, final]) |
| |
n | Encodes *object* (taking the current state of the encoder into account) and |
n | Encodes *object* (taking the current state of the encoder into account) |
| returns the resulting encoded object. If this is the last call to :meth:`encode` |
| and returns the resulting encoded object. If this is the last call to |
| *final* must be true (the default is false). |
| :meth:`encode` *final* must be true (the default is false). |
| |
| |
n | .. method:: IncrementalEncoder.reset() |
n | .. method:: reset() |
| |
n | Reset the encoder to the initial state. |
n | Reset the encoder to the initial state. |
| |
| |
| .. _incremental-decoder-objects: |
| |
| IncrementalDecoder Objects |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| The :class:`IncrementalDecoder` class is used for decoding an input in multiple |
| |
| All incremental decoders must provide this constructor interface. They are free |
| to add additional keyword arguments, but only the ones defined here are used by |
| the Python codec registry. |
| |
| The :class:`IncrementalDecoder` may implement different error handling schemes |
| by providing the *errors* keyword argument. These parameters are predefined: |
| |
n | * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. |
n | * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. |
| |
n | * ``'ignore'`` Ignore the character and continue with the next. |
n | * ``'ignore'`` Ignore the character and continue with the next. |
| |
n | * ``'replace'`` Replace with a suitable replacement character. |
n | * ``'replace'`` Replace with a suitable replacement character. |
| |
| The *errors* argument will be assigned to an attribute of the same name. |
| Assigning to this attribute makes it possible to switch between different error |
n | handling strategies during the lifetime of the :class:`IncrementalEncoder` |
n | handling strategies during the lifetime of the :class:`IncrementalDecoder` |
| object. |
| |
| The set of allowed values for the *errors* argument can be extended with |
| :func:`register_error`. |
| |
| |
n | .. method:: IncrementalDecoder.decode(object[, final]) |
n | .. method:: decode(object[, final]) |
| |
n | Decodes *object* (taking the current state of the decoder into account) and |
n | Decodes *object* (taking the current state of the decoder into account) |
| returns the resulting decoded object. If this is the last call to :meth:`decode` |
| and returns the resulting decoded object. If this is the last call to |
| *final* must be true (the default is false). If *final* is true the decoder must |
| :meth:`decode` *final* must be true (the default is false). If *final* is |
| decode the input completely and must flush all buffers. If this isn't possible |
| true the decoder must decode the input completely and must flush all |
| (e.g. because of incomplete byte sequences at the end of the input) it must |
| buffers. If this isn't possible (e.g. because of incomplete byte sequences |
| initiate error handling just like in the stateless case (which might raise an |
| at the end of the input) it must initiate error handling just like in the |
| exception). |
| stateless case (which might raise an exception). |
| |
| |
n | .. method:: IncrementalDecoder.reset() |
n | .. method:: reset() |
| |
n | Reset the decoder to the initial state. |
n | Reset the decoder to the initial state. |
| |
| |
| The :class:`StreamWriter` and :class:`StreamReader` classes provide generic |
| working interfaces which can be used to implement new encoding submodules very |
| easily. See :mod:`encodings.utf_8` for an example of how this is done. |
| |
| |
| .. _stream-writer-objects: |
| |
| additional keyword arguments, but only the ones defined here are used by the |
| Python codec registry. |
| |
| *stream* must be a file-like object open for writing binary data. |
| |
| The :class:`StreamWriter` may implement different error handling schemes by |
| providing the *errors* keyword argument. These parameters are predefined: |
| |
n | * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. |
n | * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. |
| |
n | * ``'ignore'`` Ignore the character and continue with the next. |
n | * ``'ignore'`` Ignore the character and continue with the next. |
| |
n | * ``'replace'`` Replace with a suitable replacement character |
n | * ``'replace'`` Replace with a suitable replacement character |
| |
n | * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference |
n | * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference |
| |
n | * ``'backslashreplace'`` Replace with backslashed escape sequences. |
n | * ``'backslashreplace'`` Replace with backslashed escape sequences. |
| |
| The *errors* argument will be assigned to an attribute of the same name. |
| Assigning to this attribute makes it possible to switch between different error |
| handling strategies during the lifetime of the :class:`StreamWriter` object. |
| |
| The set of allowed values for the *errors* argument can be extended with |
| :func:`register_error`. |
| |
| |
n | .. method:: StreamWriter.write(object) |
n | .. method:: write(object) |
| |
n | Writes the object's contents encoded to the stream. |
n | Writes the object's contents encoded to the stream. |
| |
| |
n | .. method:: StreamWriter.writelines(list) |
n | .. method:: writelines(list) |
| |
n | Writes the concatenated list of strings to the stream (possibly by reusing the |
n | Writes the concatenated list of strings to the stream (possibly by reusing |
| :meth:`write` method). |
| the :meth:`write` method). |
| |
| |
n | .. method:: StreamWriter.reset() |
n | .. method:: reset() |
| |
n | Flushes and resets the codec buffers used for keeping state. |
n | Flushes and resets the codec buffers used for keeping state. |
| |
n | Calling this method should ensure that the data on the output is put into a |
n | Calling this method should ensure that the data on the output is put into |
| clean state that allows appending of new fresh data without having to rescan the |
| a clean state that allows appending of new fresh data without having to |
| whole stream to recover state. |
| rescan the whole stream to recover state. |
| |
| |
| In addition to the above methods, the :class:`StreamWriter` must also inherit |
| all other methods and attributes from the underlying stream. |
| |
| |
| .. _stream-reader-objects: |
| |
| StreamReader Objects |
| additional keyword arguments, but only the ones defined here are used by the |
| Python codec registry. |
| |
| *stream* must be a file-like object open for reading (binary) data. |
| |
| The :class:`StreamReader` may implement different error handling schemes by |
| providing the *errors* keyword argument. These parameters are defined: |
| |
n | * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. |
n | * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default. |
| |
n | * ``'ignore'`` Ignore the character and continue with the next. |
n | * ``'ignore'`` Ignore the character and continue with the next. |
| |
n | * ``'replace'`` Replace with a suitable replacement character. |
n | * ``'replace'`` Replace with a suitable replacement character. |
| |
| The *errors* argument will be assigned to an attribute of the same name. |
| Assigning to this attribute makes it possible to switch between different error |
| handling strategies during the lifetime of the :class:`StreamReader` object. |
| |
| The set of allowed values for the *errors* argument can be extended with |
| :func:`register_error`. |
| |
| |
n | .. method:: StreamReader.read([size[, chars, [firstline]]]) |
n | .. method:: read([size[, chars, [firstline]]]) |
| |
n | Decodes data from the stream and returns the resulting object. |
n | Decodes data from the stream and returns the resulting object. |
| |
n | *chars* indicates the number of characters to read from the stream. :func:`read` |
n | *chars* indicates the number of characters to read from the |
| will never return more than *chars* characters, but it might return less, if |
| stream. :func:`read` will never return more than *chars* characters, but |
| there are not enough characters available. |
| it might return less, if there are not enough characters available. |
| |
n | *size* indicates the approximate maximum number of bytes to read from the stream |
n | *size* indicates the approximate maximum number of bytes to read from the |
| for decoding purposes. The decoder can modify this setting as appropriate. The |
| stream for decoding purposes. The decoder can modify this setting as |
| default value -1 indicates to read and decode as much as possible. *size* is |
| appropriate. The default value -1 indicates to read and decode as much as |
| intended to prevent having to decode huge files in one step. |
| possible. *size* is intended to prevent having to decode huge files in |
| one step. |
| |
n | *firstline* indicates that it would be sufficient to only return the first line, |
n | *firstline* indicates that it would be sufficient to only return the first |
| if there are decoding errors on later lines. |
| line, if there are decoding errors on later lines. |
| |
n | The method should use a greedy read strategy meaning that it should read as much |
n | The method should use a greedy read strategy meaning that it should read |
| data as is allowed within the definition of the encoding and the given size, |
| as much data as is allowed within the definition of the encoding and the |
| e.g. if optional encoding endings or state markers are available on the stream, |
| given size, e.g. if optional encoding endings or state markers are |
| these should be read too. |
| available on the stream, these should be read too. |
| |
n | .. versionchanged:: 2.4 |
n | .. versionchanged:: 2.4 |
| *chars* argument added. |
| *chars* argument added. |
| |
n | .. versionchanged:: 2.4.2 |
n | .. versionchanged:: 2.4.2 |
| *firstline* argument added. |
| *firstline* argument added. |
| |
| |
n | .. method:: StreamReader.readline([size[, keepends]]) |
n | .. method:: readline([size[, keepends]]) |
| |
n | Read one line from the input stream and return the decoded data. |
n | Read one line from the input stream and return the decoded data. |
| |
n | *size*, if given, is passed as size argument to the stream's :meth:`readline` |
n | *size*, if given, is passed as size argument to the stream's |
| method. |
| :meth:`readline` method. |
| |
n | If *keepends* is false line-endings will be stripped from the lines returned. |
n | If *keepends* is false line-endings will be stripped from the lines |
| returned. |
| |
n | .. versionchanged:: 2.4 |
n | .. versionchanged:: 2.4 |
| *keepends* argument added. |
| *keepends* argument added. |
| |
| |
n | .. method:: StreamReader.readlines([sizehint[, keepends]]) |
n | .. method:: readlines([sizehint[, keepends]]) |
| |
n | Read all lines available on the input stream and return them as a list of lines. |
n | Read all lines available on the input stream and return them as a list of |
| lines. |
| |
n | Line-endings are implemented using the codec's decoder method and are included |
n | Line-endings are implemented using the codec's decoder method and are |
| in the list entries if *keepends* is true. |
| included in the list entries if *keepends* is true. |
| |
n | *sizehint*, if given, is passed as the *size* argument to the stream's |
n | *sizehint*, if given, is passed as the *size* argument to the stream's |
| :meth:`read` method. |
| :meth:`read` method. |
| |
| |
n | .. method:: StreamReader.reset() |
n | .. method:: reset() |
| |
n | Resets the codec buffers used for keeping state. |
n | Resets the codec buffers used for keeping state. |
| |
n | Note that no stream repositioning should take place. This method is primarily |
n | Note that no stream repositioning should take place. This method is |
| intended to be able to recover from decoding errors. |
| primarily intended to be able to recover from decoding errors. |
| |
| |
| In addition to the above methods, the :class:`StreamReader` must also inherit |
| all other methods and attributes from the underlying stream. |
| |
| The next two base classes are included for convenience. They are not needed by |
| the codec registry, but may provide useful in practice. |
| |
| |
| *encode* and *decode* are needed for the frontend translation, *Reader* and |
| *Writer* for the backend translation. The intermediate format used is |
| determined by the two sets of codecs, e.g. the Unicode codecs will use Unicode |
| as the intermediate encoding. |
| |
| Error handling is done in the same way as defined for the stream readers and |
| writers. |
| |
n | |
| :class:`StreamRecoder` instances define the combined interfaces of |
| :class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other |
| methods and attributes from the underlying stream. |
| |
| |
| .. _encodings-overview: |
| |
| Encodings and Unicode |
| --------------------- |
| |
| Unicode strings are stored internally as sequences of codepoints (to be precise |
| as :ctype:`Py_UNICODE` arrays). Depending on the way Python is compiled (either |
n | via :option:`--enable-unicode=ucs2` or :option:`--enable-unicode=ucs4`, with |
n | via :option:`--enable-unicode=ucs2` or :option:`--enable-unicode=ucs4`, with the |
| the former being the default) :ctype:`Py_UNICODE` is either a 16-bit or 32-bit |
| former being the default) :ctype:`Py_UNICODE` is either a 16-bit or 32-bit data |
| data type. Once a Unicode object is used outside of CPU and memory, CPU |
| type. Once a Unicode object is used outside of CPU and memory, CPU endianness |
| endianness and how these arrays are stored as bytes become an issue. |
| and how these arrays are stored as bytes become an issue. Transforming a |
| Transforming a unicode object into a sequence of bytes is called encoding and |
| unicode object into a sequence of bytes is called encoding and recreating the |
| recreating the unicode object from the sequence of bytes is known as decoding. |
| unicode object from the sequence of bytes is known as decoding. There are many |
| There are many different methods for how this transformation can be done (these |
| different methods for how this transformation can be done (these methods are |
| methods are also called encodings). The simplest method is to map the codepoints |
| also called encodings). The simplest method is to map the codepoints 0-255 to |
| 0-255 to the bytes ``0x0``\ -\ ``0xff``. This means that a unicode object that |
| the bytes ``0x0``-``0xff``. This means that a unicode object that contains |
| contains codepoints above ``U+00FF`` can't be encoded with this method (which |
| codepoints above ``U+00FF`` can't be encoded with this method (which is called |
| is called ``'latin-1'`` or ``'iso-8859-1'``). :func:`unicode.encode` will raise |
| ``'latin-1'`` or ``'iso-8859-1'``). :func:`unicode.encode` will raise a |
| a :exc:`UnicodeEncodeError` that looks like this: ``UnicodeEncodeError: |
| :exc:`UnicodeEncodeError` that looks like this: ``UnicodeEncodeError: 'latin-1' |
| 'latin-1' codec can't encode character u'\u1234' in position 3: ordinal not in |
| codec can't encode character u'\u1234' in position 3: ordinal not in |
| range(256)``. |
| |
| There's another group of encodings (the so called charmap encodings) that choose |
| a different subset of all unicode code points and how these codepoints are |
n | mapped to the bytes ``0x0``\ -\ ``0xff.`` To see how this is done simply open |
n | mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open |
| e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on |
| Windows). There's a string constant with 256 characters that shows you which |
| character is mapped to which byte value. |
| |
| All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints |
| defined in unicode. A simple and straightforward way that can store each Unicode |
| code point, is to store each codepoint as two consecutive bytes. There are two |
| possibilities: Store the bytes in big endian or in little endian order. These |
| As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in |
| the decoded Unicode string (even if it's the first character) is treated as a |
| ``ZERO WIDTH NO-BREAK SPACE``. |
| |
| Without external information it's impossible to reliably determine which |
| encoding was used for encoding a Unicode string. Each charmap encoding can |
| decode any random byte sequence. However that's not possible with UTF-8, as |
| UTF-8 byte sequences have a structure that doesn't allow arbitrary byte |
n | sequence. To increase the reliability with which a UTF-8 encoding can be |
n | sequences. To increase the reliability with which a UTF-8 encoding can be |
| detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls |
| ``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters |
| is written to the file, a UTF-8 encoded BOM (which looks like this as a byte |
| sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable |
| that any charmap encoded file starts with these byte values (which would e.g. |
| map to |
| |
n | LATIN SMALL LETTER I WITH DIAERESIS --- RIGHT-POINTING DOUBLE ANGLE QUOTATION |
n | | LATIN SMALL LETTER I WITH DIAERESIS |
| | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK |
| MARK --- INVERTED QUESTION MARK |
| | INVERTED QUESTION MARK |
| |
| in iso-8859-1), this increases the probability that a utf-8-sig encoding can be |
| correctly guessed from the byte sequence. So here the BOM is not used to be able |
| to determine the byte order used for generating the byte sequence, but as a |
| signature that helps in guessing the encoding. On encoding the utf-8-sig codec |
| will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On |
| decoding utf-8-sig will skip those three bytes if they appear as the first three |
| bytes in the file. |
| +-----------------+--------------------------------+--------------------------------+ |
| | latin_1 | iso-8859-1, iso8859-1, 8859, | West Europe | |
| | | cp819, latin, latin1, L1 | | |
| +-----------------+--------------------------------+--------------------------------+ |
| | iso8859_2 | iso-8859-2, latin2, L2 | Central and Eastern Europe | |
| +-----------------+--------------------------------+--------------------------------+ |
| | iso8859_3 | iso-8859-3, latin3, L3 | Esperanto, Maltese | |
| +-----------------+--------------------------------+--------------------------------+ |
n | | iso8859_4 | iso-8859-4, latin4, L4 | Baltic languagues | |
n | | iso8859_4 | iso-8859-4, latin4, L4 | Baltic languages | |
| +-----------------+--------------------------------+--------------------------------+ |
| | iso8859_5 | iso-8859-5, cyrillic | Bulgarian, Byelorussian, | |
| | | | Macedonian, Russian, Serbian | |
| +-----------------+--------------------------------+--------------------------------+ |
| | iso8859_6 | iso-8859-6, arabic | Arabic | |
| +-----------------+--------------------------------+--------------------------------+ |
| | iso8859_7 | iso-8859-7, greek, greek8 | Greek | |
| +-----------------+--------------------------------+--------------------------------+ |
| | shift_jis | csshiftjis, shiftjis, sjis, | Japanese | |
| | | s_jis | | |
| +-----------------+--------------------------------+--------------------------------+ |
| | shift_jis_2004 | shiftjis2004, sjis_2004, | Japanese | |
| | | sjis2004 | | |
| +-----------------+--------------------------------+--------------------------------+ |
| | shift_jisx0213 | shiftjisx0213, sjisx0213, | Japanese | |
| | | s_jisx0213 | | |
n | +-----------------+--------------------------------+--------------------------------+ |
| | utf_32 | U32, utf32 | all languages | |
| +-----------------+--------------------------------+--------------------------------+ |
| | utf_32_be | UTF-32BE | all languages | |
| +-----------------+--------------------------------+--------------------------------+ |
| | utf_32_le | UTF-32LE | all languages | |
| +-----------------+--------------------------------+--------------------------------+ |
| | utf_16 | U16, utf16 | all languages | |
| +-----------------+--------------------------------+--------------------------------+ |
| | utf_16_be | UTF-16BE | all languages (BMP only) | |
| +-----------------+--------------------------------+--------------------------------+ |
| | utf_16_le | UTF-16LE | all languages (BMP only) | |
| +-----------------+--------------------------------+--------------------------------+ |
| | utf_7 | U7, unicode-1-1-utf-7 | all languages | |
| | bz2_codec | bz2 | byte string | Compress the operand | |
| | | | | using bz2 | |
| +--------------------+---------------------------+----------------+---------------------------+ |
| | hex_codec | hex | byte string | Convert operand to | |
| | | | | hexadecimal | |
| | | | | representation, with two | |
| | | | | digits per byte | |
| +--------------------+---------------------------+----------------+---------------------------+ |
n | | idna | | Unicode string | Implements :rfc:`3490`. | |
n | | idna | | Unicode string | Implements :rfc:`3490`, | |
| | | | | See also | |
| | | | | see also | |
| | | | | :mod:`encodings.idna` | |
| +--------------------+---------------------------+----------------+---------------------------+ |
| | mbcs | dbcs | Unicode string | Windows only: Encode | |
| | | | | operand according to the | |
| | | | | ANSI codepage (CP_ACP) | |
| +--------------------+---------------------------+----------------+---------------------------+ |
| | palmos | | Unicode string | Encoding of PalmOS 3.5 | |
| +--------------------+---------------------------+----------------+---------------------------+ |
n | | punycode | | Unicode string | Implements :rfc:`3492`. | |
n | | punycode | | Unicode string | Implements :rfc:`3492` | |
| +--------------------+---------------------------+----------------+---------------------------+ |
| | quopri_codec | quopri, quoted-printable, | byte string | Convert operand to MIME | |
| | | quotedprintable | | quoted printable | |
| +--------------------+---------------------------+----------------+---------------------------+ |
| | raw_unicode_escape | | Unicode string | Produce a string that is | |
| | | | | suitable as raw Unicode | |
| | | | | literal in Python source | |
| | | | | code | |
| +--------------------+---------------------------+----------------+---------------------------+ |
| | uu_codec | uu | byte string | Convert the operand using | |
| | | | | uuencode | |
| +--------------------+---------------------------+----------------+---------------------------+ |
| | zlib_codec | zip, zlib | byte string | Compress the operand | |
| | | | | using gzip | |
| +--------------------+---------------------------+----------------+---------------------------+ |
| |
n | .. versionadded:: 2.3 |
| The ``idna`` and ``punycode`` encodings. |
| |
| |
| :mod:`encodings.idna` --- Internationalized Domain Names in Applications |
| ------------------------------------------------------------------------ |
| |
| .. module:: encodings.idna |
| :synopsis: Internationalized Domain Names implementation |
n | |
| |
| .. % XXX The next line triggers a formatting bug, so it's commented out |
| .. % until that can be fixed. |
| .. % \moduleauthor{Martin v. L\"owis} |
| .. moduleauthor:: Martin v. Löwis |
| |
| .. versionadded:: 2.3 |
| |
| This module implements :rfc:`3490` (Internationalized Domain Names in |
| Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for |
| Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding |
| and :mod:`stringprep`. |
| |
| These RFCs together define a protocol to support non-ASCII characters in domain |
| names. A domain name containing non-ASCII characters (such as |
n | "www.Alliancefrançaise.nu") is converted into an ASCII-compatible encoding (ACE, |
n | ``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding |
| such as "www.xn--alliancefranaise-npb.nu"). The ACE form of the domain name is |
| (ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain |
| then used in all places where arbitrary characters are not allowed by the |
| name is then used in all places where arbitrary characters are not allowed by |
| protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so on. This |
| the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so |
| conversion is carried out in the application; if possible invisible to the user: |
| on. This conversion is carried out in the application; if possible invisible to |
| The application should transparently convert Unicode domain labels to IDNA on |
| the user: The application should transparently convert Unicode domain labels to |
| the wire, and convert back ACE labels to Unicode before presenting them to the |
| IDNA on the wire, and convert back ACE labels to Unicode before presenting them |
| user. |
| to the user. |
| |
| Python supports this conversion in several ways: The ``idna`` codec allows to |
| convert between Unicode and the ACE. Furthermore, the :mod:`socket` module |
| transparently converts Unicode host names to ACE, so that applications need not |
| be concerned about converting host names themselves when they pass them to the |
| socket module. On top of that, modules that have host names as function |
| parameters, such as :mod:`httplib` and :mod:`ftplib`, accept Unicode host names |
| (:mod:`httplib` then also transparently sends an IDNA hostname in the |