| ============ |
| Unicode data |
| ============ |
| |
| Django natively supports Unicode data everywhere. Providing your database can |
| somehow store the data, you can safely pass around Unicode strings to |
| templates, models and the database. |
| |
| This document tells you what you need to know if you're writing applications |
| that use data or templates that are encoded in something other than ASCII. |
| |
| Creating the database |
| ===================== |
| |
| Make sure your database is configured to be able to store arbitrary string |
| data. Normally, this means giving it an encoding of UTF-8 or UTF-16. If you use |
| a more restrictive encoding -- for example, latin1 (iso8859-1) -- you won't be |
| able to store certain characters in the database, and information will be lost. |
| |
| * MySQL users, refer to the `MySQL manual`_ (section 10.1.3.2 for MySQL 5.1) |
| for details on how to set or alter the database character set encoding. |
| |
| * PostgreSQL users, refer to the `PostgreSQL manual`_ (section 21.2.2 in |
| PostgreSQL 8) for details on creating databases with the correct encoding. |
| |
| * SQLite users, there is nothing you need to do. SQLite always uses UTF-8 |
| for internal encoding. |
| |
| .. _MySQL manual: http://dev.mysql.com/doc/refman/5.1/en/charset-database.html |
| .. _PostgreSQL manual: http://www.postgresql.org/docs/8.2/static/multibyte.html#AEN24104 |
| |
| All of Django's database backends automatically convert Unicode strings into |
| the appropriate encoding for talking to the database. They also automatically |
| convert strings retrieved from the database into Python Unicode strings. You |
| don't even need to tell Django what encoding your database uses: that is |
| handled transparently. |
| |
| For more, see the section "The database API" below. |
| |
| General string handling |
| ======================= |
| |
| Whenever you use strings with Django -- e.g., in database lookups, template |
| rendering or anywhere else -- you have two choices for encoding those strings. |
| You can use Unicode strings, or you can use normal strings (sometimes called |
| "bytestrings") that are encoded using UTF-8. |
| |
| .. versionchanged:: 1.5 |
| |
| In Python 3, the logic is reversed, that is normal strings are Unicode, and |
| when you want to specifically create a bytestring, you have to prefix the |
| string with a 'b'. As we are doing in Django code from version 1.5, |
| we recommend that you import ``unicode_literals`` from the __future__ library |
| in your code. Then, when you specifically want to create a bytestring literal, |
| prefix the string with 'b'. |
| |
| Python 2 legacy:: |
| |
| my_string = "This is a bytestring" |
| my_unicode = u"This is an Unicode string" |
| |
| Python 2 with unicode literals or Python 3:: |
| |
| from __future__ import unicode_literals |
| |
| my_string = b"This is a bytestring" |
| my_unicode = "This is an Unicode string" |
| |
| See also :doc:`Python 3 compatibility </topics/python3>`. |
| |
| .. admonition:: Warning |
| |
| A bytestring does not carry any information with it about its encoding. |
| For that reason, we have to make an assumption, and Django assumes that all |
| bytestrings are in UTF-8. |
| |
| If you pass a string to Django that has been encoded in some other format, |
| things will go wrong in interesting ways. Usually, Django will raise a |
| ``UnicodeDecodeError`` at some point. |
| |
| If your code only uses ASCII data, it's safe to use your normal strings, |
| passing them around at will, because ASCII is a subset of UTF-8. |
| |
| Don't be fooled into thinking that if your :setting:`DEFAULT_CHARSET` setting is set |
| to something other than ``'utf-8'`` you can use that other encoding in your |
| bytestrings! :setting:`DEFAULT_CHARSET` only applies to the strings generated as |
| the result of template rendering (and email). Django will always assume UTF-8 |
| encoding for internal bytestrings. The reason for this is that the |
| :setting:`DEFAULT_CHARSET` setting is not actually under your control (if you are the |
| application developer). It's under the control of the person installing and |
| using your application -- and if that person chooses a different setting, your |
| code must still continue to work. Ergo, it cannot rely on that setting. |
| |
| In most cases when Django is dealing with strings, it will convert them to |
| Unicode strings before doing anything else. So, as a general rule, if you pass |
| in a bytestring, be prepared to receive a Unicode string back in the result. |
| |
| Translated strings |
| ------------------ |
| |
| Aside from Unicode strings and bytestrings, there's a third type of string-like |
| object you may encounter when using Django. The framework's |
| internationalization features introduce the concept of a "lazy translation" -- |
| a string that has been marked as translated but whose actual translation result |
| isn't determined until the object is used in a string. This feature is useful |
| in cases where the translation locale is unknown until the string is used, even |
| though the string might have originally been created when the code was first |
| imported. |
| |
| Normally, you won't have to worry about lazy translations. Just be aware that |
| if you examine an object and it claims to be a |
| ``django.utils.functional.__proxy__`` object, it is a lazy translation. |
| Calling ``unicode()`` with the lazy translation as the argument will generate a |
| Unicode string in the current locale. |
| |
| For more details about lazy translation objects, refer to the |
| :doc:`internationalization </topics/i18n/index>` documentation. |
| |
| Useful utility functions |
| ------------------------ |
| |
| Because some string operations come up again and again, Django ships with a few |
| useful functions that should make working with Unicode and bytestring objects |
| a bit easier. |
| |
| Conversion functions |
| ~~~~~~~~~~~~~~~~~~~~ |
| |
| The ``django.utils.encoding`` module contains a few functions that are handy |
| for converting back and forth between Unicode and bytestrings. |
| |
| * ``smart_text(s, encoding='utf-8', strings_only=False, errors='strict')`` |
| converts its input to a Unicode string. The ``encoding`` parameter |
| specifies the input encoding. (For example, Django uses this internally |
| when processing form input data, which might not be UTF-8 encoded.) The |
| ``strings_only`` parameter, if set to True, will result in Python |
| numbers, booleans and ``None`` not being converted to a string (they keep |
| their original types). The ``errors`` parameter takes any of the values |
| that are accepted by Python's ``unicode()`` function for its error |
| handling. |
| |
| If you pass ``smart_text()`` an object that has a ``__unicode__`` |
| method, it will use that method to do the conversion. |
| |
| * ``force_text(s, encoding='utf-8', strings_only=False, |
| errors='strict')`` is identical to ``smart_text()`` in almost all |
| cases. The difference is when the first argument is a :ref:`lazy |
| translation <lazy-translations>` instance. While ``smart_text()`` |
| preserves lazy translations, ``force_text()`` forces those objects to a |
| Unicode string (causing the translation to occur). Normally, you'll want |
| to use ``smart_text()``. However, ``force_text()`` is useful in |
| template tags and filters that absolutely *must* have a string to work |
| with, not just something that can be converted to a string. |
| |
| * ``smart_bytes(s, encoding='utf-8', strings_only=False, errors='strict')`` |
| is essentially the opposite of ``smart_text()``. It forces the first |
| argument to a bytestring. The ``strings_only`` parameter has the same |
| behavior as for ``smart_text()`` and ``force_text()``. This is |
| slightly different semantics from Python's builtin ``str()`` function, |
| but the difference is needed in a few places within Django's internals. |
| |
| Normally, you'll only need to use ``smart_text()``. Call it as early as |
| possible on any input data that might be either Unicode or a bytestring, and |
| from then on, you can treat the result as always being Unicode. |
| |
| .. _uri-and-iri-handling: |
| |
| URI and IRI handling |
| ~~~~~~~~~~~~~~~~~~~~ |
| |
| Web frameworks have to deal with URLs (which are a type of IRI_). One |
| requirement of URLs is that they are encoded using only ASCII characters. |
| However, in an international environment, you might need to construct a |
| URL from an IRI_ -- very loosely speaking, a URI_ that can contain Unicode |
| characters. Quoting and converting an IRI to URI can be a little tricky, so |
| Django provides some assistance. |
| |
| * The function ``django.utils.encoding.iri_to_uri()`` implements the |
| conversion from IRI to URI as required by the specification (:rfc:`3987`). |
| |
| * The functions ``django.utils.http.urlquote()`` and |
| ``django.utils.http.urlquote_plus()`` are versions of Python's standard |
| ``urllib.quote()`` and ``urllib.quote_plus()`` that work with non-ASCII |
| characters. (The data is converted to UTF-8 prior to encoding.) |
| |
| These two groups of functions have slightly different purposes, and it's |
| important to keep them straight. Normally, you would use ``urlquote()`` on the |
| individual portions of the IRI or URI path so that any reserved characters |
| such as '&' or '%' are correctly encoded. Then, you apply ``iri_to_uri()`` to |
| the full IRI and it converts any non-ASCII characters to the correct encoded |
| values. |
| |
| .. note:: |
| Technically, it isn't correct to say that ``iri_to_uri()`` implements the |
| full algorithm in the IRI specification. It doesn't (yet) perform the |
| international domain name encoding portion of the algorithm. |
| |
| The ``iri_to_uri()`` function will not change ASCII characters that are |
| otherwise permitted in a URL. So, for example, the character '%' is not |
| further encoded when passed to ``iri_to_uri()``. This means you can pass a |
| full URL to this function and it will not mess up the query string or anything |
| like that. |
| |
| An example might clarify things here:: |
| |
| >>> urlquote(u'Paris & Orléans') |
| u'Paris%20%26%20Orl%C3%A9ans' |
| >>> iri_to_uri(u'/favorites/François/%s' % urlquote('Paris & Orléans')) |
| '/favorites/Fran%C3%A7ois/Paris%20%26%20Orl%C3%A9ans' |
| |
| If you look carefully, you can see that the portion that was generated by |
| ``urlquote()`` in the second example was not double-quoted when passed to |
| ``iri_to_uri()``. This is a very important and useful feature. It means that |
| you can construct your IRI without worrying about whether it contains |
| non-ASCII characters and then, right at the end, call ``iri_to_uri()`` on the |
| result. |
| |
| The ``iri_to_uri()`` function is also idempotent, which means the following is |
| always true:: |
| |
| iri_to_uri(iri_to_uri(some_string)) = iri_to_uri(some_string) |
| |
| So you can safely call it multiple times on the same IRI without risking |
| double-quoting problems. |
| |
| .. _URI: http://www.ietf.org/rfc/rfc2396.txt |
| .. _IRI: http://www.ietf.org/rfc/rfc3987.txt |
| |
| Models |
| ====== |
| |
| Because all strings are returned from the database as Unicode strings, model |
| fields that are character based (CharField, TextField, URLField, etc) will |
| contain Unicode values when Django retrieves data from the database. This |
| is *always* the case, even if the data could fit into an ASCII bytestring. |
| |
| You can pass in bytestrings when creating a model or populating a field, and |
| Django will convert it to Unicode when it needs to. |
| |
| Choosing between ``__str__()`` and ``__unicode__()`` |
| ---------------------------------------------------- |
| |
| .. note:: |
| |
| If you are on Python 3, you can skip this section because you'll always |
| create ``__str__()`` rather than ``__unicode__()``. If you'd like |
| compatibility with Python 2, you can decorate your model class with |
| :func:`~django.utils.encoding.python_2_unicode_compatible`. |
| |
| One consequence of using Unicode by default is that you have to take some care |
| when printing data from the model. |
| |
| In particular, rather than giving your model a ``__str__()`` method, we |
| recommended you implement a ``__unicode__()`` method. In the ``__unicode__()`` |
| method, you can quite safely return the values of all your fields without |
| having to worry about whether they fit into a bytestring or not. (The way |
| Python works, the result of ``__str__()`` is *always* a bytestring, even if you |
| accidentally try to return a Unicode object). |
| |
| You can still create a ``__str__()`` method on your models if you want, of |
| course, but you shouldn't need to do this unless you have a good reason. |
| Django's ``Model`` base class automatically provides a ``__str__()`` |
| implementation that calls ``__unicode__()`` and encodes the result into UTF-8. |
| This means you'll normally only need to implement a ``__unicode__()`` method |
| and let Django handle the coercion to a bytestring when required. |
| |
| Taking care in ``get_absolute_url()`` |
| ------------------------------------- |
| |
| URLs can only contain ASCII characters. If you're constructing a URL from |
| pieces of data that might be non-ASCII, be careful to encode the results in a |
| way that is suitable for a URL. The :func:`~django.core.urlresolvers.reverse` |
| function handles this for you automatically. |
| |
| If you're constructing a URL manually (i.e., *not* using the ``reverse()`` |
| function), you'll need to take care of the encoding yourself. In this case, |
| use the ``iri_to_uri()`` and ``urlquote()`` functions that were documented |
| above_. For example:: |
| |
| from django.utils.encoding import iri_to_uri |
| from django.utils.http import urlquote |
| |
| def get_absolute_url(self): |
| url = u'/person/%s/?x=0&y=0' % urlquote(self.location) |
| return iri_to_uri(url) |
| |
| This function returns a correctly encoded URL even if ``self.location`` is |
| something like "Jack visited Paris & Orléans". (In fact, the ``iri_to_uri()`` |
| call isn't strictly necessary in the above example, because all the |
| non-ASCII characters would have been removed in quoting in the first line.) |
| |
| .. _above: `URI and IRI handling`_ |
| |
| The database API |
| ================ |
| |
| You can pass either Unicode strings or UTF-8 bytestrings as arguments to |
| ``filter()`` methods and the like in the database API. The following two |
| querysets are identical:: |
| |
| from __future__ import unicode_literals |
| |
| qs = People.objects.filter(name__contains='Å') |
| qs = People.objects.filter(name__contains=b'\xc3\x85') # UTF-8 encoding of Å |
| |
| Templates |
| ========= |
| |
| You can use either Unicode or bytestrings when creating templates manually:: |
| |
| from __future__ import unicode_literals |
| from django.template import Template |
| t1 = Template(b'This is a bytestring template.') |
| t2 = Template('This is a Unicode template.') |
| |
| But the common case is to read templates from the filesystem, and this creates |
| a slight complication: not all filesystems store their data encoded as UTF-8. |
| If your template files are not stored with a UTF-8 encoding, set the :setting:`FILE_CHARSET` |
| setting to the encoding of the files on disk. When Django reads in a template |
| file, it will convert the data from this encoding to Unicode. (:setting:`FILE_CHARSET` |
| is set to ``'utf-8'`` by default.) |
| |
| The :setting:`DEFAULT_CHARSET` setting controls the encoding of rendered templates. |
| This is set to UTF-8 by default. |
| |
| Template tags and filters |
| ------------------------- |
| |
| A couple of tips to remember when writing your own template tags and filters: |
| |
| * Always return Unicode strings from a template tag's ``render()`` method |
| and from template filters. |
| |
| * Use ``force_text()`` in preference to ``smart_text()`` in these |
| places. Tag rendering and filter calls occur as the template is being |
| rendered, so there is no advantage to postponing the conversion of lazy |
| translation objects into strings. It's easier to work solely with Unicode |
| strings at that point. |
| |
| Email |
| ===== |
| |
| Django's email framework (in ``django.core.mail``) supports Unicode |
| transparently. You can use Unicode data in the message bodies and any headers. |
| However, you're still obligated to respect the requirements of the email |
| specifications, so, for example, email addresses should use only ASCII |
| characters. |
| |
| The following code example demonstrates that everything except email addresses |
| can be non-ASCII:: |
| |
| from __future__ import unicode_literals |
| from django.core.mail import EmailMessage |
| |
| subject = 'My visit to Sør-Trøndelag' |
| sender = 'Arnbjörg Ráðormsdóttir <arnbjorg@example.com>' |
| recipients = ['Fred <fred@example.com'] |
| body = '...' |
| msg = EmailMessage(subject, body, sender, recipients) |
| msg.attach("Une pièce jointe.pdf", "%PDF-1.4.%...", mimetype="application/pdf") |
| msg.send() |
| |
| Form submission |
| =============== |
| |
| HTML form submission is a tricky area. There's no guarantee that the |
| submission will include encoding information, which means the framework might |
| have to guess at the encoding of submitted data. |
| |
| Django adopts a "lazy" approach to decoding form data. The data in an |
| ``HttpRequest`` object is only decoded when you access it. In fact, most of |
| the data is not decoded at all. Only the ``HttpRequest.GET`` and |
| ``HttpRequest.POST`` data structures have any decoding applied to them. Those |
| two fields will return their members as Unicode data. All other attributes and |
| methods of ``HttpRequest`` return data exactly as it was submitted by the |
| client. |
| |
| By default, the :setting:`DEFAULT_CHARSET` setting is used as the assumed encoding |
| for form data. If you need to change this for a particular form, you can set |
| the ``encoding`` attribute on an ``HttpRequest`` instance. For example:: |
| |
| def some_view(request): |
| # We know that the data must be encoded as KOI8-R (for some reason). |
| request.encoding = 'koi8-r' |
| ... |
| |
| You can even change the encoding after having accessed ``request.GET`` or |
| ``request.POST``, and all subsequent accesses will use the new encoding. |
| |
| Most developers won't need to worry about changing form encoding, but this is |
| a useful feature for applications that talk to legacy systems whose encoding |
| you cannot control. |
| |
| Django does not decode the data of file uploads, because that data is normally |
| treated as collections of bytes, rather than strings. Any automatic decoding |
| there would alter the meaning of the stream of bytes. |