cult3

Sorting Multiple Languages and Weird Characters in PHP and MySQL

Oct 19, 2015

Table of contents:

  1. Why is this a problem?
  2. Examples of the problem
  3. How should you sort words?
  4. Using the Collator class
  5. Sorting in the database
  6. Conclusion

In the insulated world of application development, your environment is completely controlled. The general use-cases are known and well understood, and you test your application using seed data to try and cover all expected edge-cases.

But then your application hits the real world and things start to not work as you thought they would! Suddenly your application is used in ways that you would never expect and users start entering data that you didn’t even consider.

One such example of this is sorting data. In the closed world of application development, strings can be sorted very easily using your programming language or database of choice.

However, in the real world, things aren’t always that easy. Users will enter weird data, and it won’t always be english.

If sorting data is critical to your application, you need to find a better way!

In today’s tutorial I’m going to be looking at sorting strings in multiple languages and with weird characters.

Why is this a problem?

For most types of applications, this will probably never be a problem. It only becomes a problem when you accept weird strings of characters or being able to sort the data is very important to the application.

For the most part, English words will generally be sortable without too much trouble. Many types of web application will never encounter this problem. However, words with accents do pose a problem.

If you are building a web application for multiple languages, you will also face the problem of language-specific characters and sorting rules.

And even strings with unexpected characters can totally derail your efforts to sort a list in a “natural” way.

Examples of the problem

In order to fully understand the problems we face, lets take a look at a could of examples.

Imagine we have the following data in a table:

+—-+———+
| id | value |
+—-+———+
| 1 | item 1 |
| 2 | item 2 |
| 3 | item 3 |
| 4 | item 12 |
| 5 | item 22 |
+—-+———+

We need to order the table by the value, so we run this query:

SELECT * FROM `values` ORDER BY `value`;

+—-+———+
| id | value |
+—-+———+
| 1 | item 1 |
| 4 | item 12 |
| 2 | item 2 |
| 5 | item 22 |
| 3 | item 3 |
+—-+———+

Not what we were expecting! Let’s move on to the next example.

Imagine we have the following table of names:

+—-+———-+
| id | value |
+—-+———-+
| 14 | José |
| 15 | Joseph |
| 16 | Josephine |
| 17 | Joséph |
| 18 | joseph |
+—-+———-+

If we run the following case-insensitive query, we get the these results:

SELECT * FROM `values` ORDER BY value ASC;

+—-+———-+
| id | value |
+—-+———-+
| 14 | José |
| 15 | Joseph |
| 17 | Joséph |
| 18 | joseph |
| 16 | Josephine |
+—-+———-+

Alternatively we could run a case-sensitive query, but the results are still not right:

SELECT * FROM `values` ORDER BY BINARY value ASC;

+—-+———-+
| id | value |
+—-+———-+
| 15 | Joseph |
| 16 | Josephine |
| 14 | José |
| 17 | Joséph |
| 18 | joseph |
+—-+———-+

Finally, imagine we were making an international sandwich and we had the following ingredients:

+—-+————+
| id | value |
+—-+————+
| 7 | сыр |
| 8 | pão |
| 9 | salat |
| 10 | jambon |
| 11 | mayonnaise |
+——+——————+
SELECT * FROM `values` ORDER BY `value`;

+—-+————+
| id | value |
+—-+————+
| 10 | jambon |
| 11 | mayonnaise |
| 8 | pão |
| 9 | salat |
| 7 | сыр |
+—-+————+

Once again, although the words are in different languages, we would expect them to be in a slightly different order.

How should you sort words?

So we’ve looked at a couple of examples of sorting data with results that don’t look quite right, but how are you supposed to sort words anyway?

Words should be sorted according to the Unicode Collation Algorithm.

Each locale basically has a set of rules for dealing with sorting. So sorting words in English is going to be different to sorting words in German, or Chinese because accents and case-insensitive strings are treated differently.

Fortunately in PHP we don’t need to implement these rules ourselves as we can simply use the intl extension.

Or more specifically, the Collator class.

Using the Collator class

First create a new instance of Collator with the given locale, in this case I’m using the english rules:

$collator = new Collator("en");

The Collator class has many different sorting rules that can be turned on and off. In this example I want to sort the first array of items so I need to ensure that numbers in strings are sorted correctly:

$collator->setAttribute(Collator::NUMERIC_COLLATION, Collator::ON);

Next we can sort the array:

$collator->sort($items);

And then print the results:

var_dump($items);

/*
array(5) {
[0]=>
string(6) "item 1"
[1]=>
string(6) "item 2"
[2]=>
string(6) "item 3"
[3]=>
string(7) "item 12"
[4]=>
string(7) "item 22"
}
*/

Sorting in the database

Sorting using the Collator class is great as there are a lot of different ways you can specify your sorting rules.

However, it’s not practical to sort big chunks of data in PHP. It’s always better to push that responsibility to the database.

If you want to do the sorting in the database, there are a couple of extra steps.

You will need to either add additional columns for the columns you want to sort on, or a join table if you want to sort by many columns, or in many different languages.

When saving the record, you need to generate the sort key of the value from the Collator class. The returned value is binary data, so we can then pass the value into the bin2hex() function;

$value = "item 22";
$value = bin2hex($collator->getSortKey($value));

var_dump($value);
// string(22) "394f3141041118010a010a"

You can now sort on that column instead of the original one. How you set this up for your application will depend on your application. If you only need to sort one column in one language, you might be able to get away with storing it as an additional property. However a better way might be to create an additional table to store your language specific values.

Conclusion

Sorting data is, for the most part, really easy. I expect 99% of all applications will probably never have to go down this route of specifically adding additional data just to do accurate sorting.

But when sorting is a big part of your application, returning weird results is not going to cut it for your customers.

Fortunately there is a fairly simple solution to this problem. PHP’s International extension makes it really easy to sort data based upon the Unicode Collation Algorithm.

The only difficult bit is figuring out the best implementation for your application.

Philip Brown

@philipbrown

© Yellow Flag Ltd 2024.