Posted on 16 Aug 2022, this text provides information on Bugs & Fixes related to General Tech. Please note that while accuracy is prioritized, the data presented might not be entirely correct or up-to-date. This information is offered for general knowledge and informational purposes only, and should not be considered as a substitute for professional advice.
I'm wondering if there is a "best" choice for collation in MySQL for a general website where you aren't 100% sure of what will be entered? I understand that all the encodings should be the same, such as MySQL, Apache, the HTML and anything inside PHP.
In the past I have set PHP to output in "UTF-8", but which collation does this match in MySQL? I'm thinking it's one of the UTF-8 ones, but I have used utf8_unicode_ci, utf8_general_ci, and utf8_bin before
Be very, very aware of this problem that can occur when using utf8_general_ci.
MySQL will not distinguish between some characters in select statements, if the utf8_general_cicollation is used. This can lead to very nasty bugs - especially for example, where usernames are involved. Depending on the implementation that uses the database tables, this problem could allow malicious users to create a username matching an administrator account.
This problem exposes itself at the very least in early 5.x versions - I'm not sure if this behaviour as changed later.
I'm no DBA, but to avoid this problem, I always go with utf8-bin instead of a case-insensitive one.
The script below describes the problem by example.
-- first, create a sandbox to play inCREATEDATABASE`sandbox`;use`sandbox`;-- next, make sure that your client connection is of the same -- character/collate type as the one we're going to test next:
charset utf8 collate utf8_general_ci
-- now, create the table and fill it with valuesCREATETABLE`test`(`key` VARCHAR(16),`value` VARCHAR(16))
CHARACTER SET utf8 COLLATE utf8_general_ci;INSERTINTO`test`VALUES('Key ONE','value'),('Key TWO','valúe');-- (verify)SELECT*FROM`test`;-- now, expose the problem/bug:SELECT*FROM test WHERE`value`='value';---- Note that we get BOTH keys here! MySQLs UTF8 collates that are -- case insensitive (ending with _ci) do not distinguish between -- both values!---- collate 'utf8_bin' doesn't have this problem, as I'll show next:---- first, reset the client connection charset/collate type
charset utf8 collate utf8_bin
-- next, convert the values that we've previously inserted in the tableALTERTABLE`test`CONVERTTO CHARACTER SET utf8 COLLATE utf8_bin;-- now, re-check for the bugSELECT*FROM test WHERE`value`='value';---- Note that we get just one key now, as you'd expect.---- This problem appears to be specific to utf8. Next, I'll try to -- do the same with the 'latin1' charset:---- first, reset the client connection charset/collate type
charset latin1 collate latin1_general_ci
-- next, convert the values that we've previously inserted-- in the tableALTERTABLE`test`CONVERTTO CHARACTER SET latin1 COLLATE latin1_general_ci;-- now, re-check for the bugSELECT*FROM test WHERE`value`='value';---- Again, only one key is returned (expected). This shows -- that the problem with utf8/utf8_generic_ci isn't present -- in latin1/latin1_general_ci---- To complete the example, I'll check with the binary collate-- of latin1 as well:-- first, reset the client connection charset/collate type
charset latin1 collate latin1_bin
-- next, convert the values that we've previously inserted in the tableALTERTABLE`test`CONVERTTO CHARACTER SET latin1 COLLATE latin1_bin;-- now, re-check for the bugSELECT*FROM test WHERE`value`='value';---- Again, only one key is returned (expected).---- Finally, I'll re-introduce the problem in the exact same -- way (for any sceptics out there):-- first, reset the client connection charset/collate type
charset utf8 collate utf8_generic_ci
-- next, convert the values that we've previously inserted in the tableALTERTABLE`test`CONVERTTO CHARACTER SET utf8 COLLATE utf8_general_ci;-- now, re-check for the problem/bugSELECT*FROM
No matter what stage you're at in your education or career, TuteeHUB will help you reach the next level that
you're aiming for. Simply,Choose a subject/topic and get started in self-paced practice
sessions to improve your knowledge and scores.
manpreet
Best Answer
2 years ago
I'm wondering if there is a "best" choice for collation in MySQL for a general website where you aren't 100% sure of what will be entered? I understand that all the encodings should be the same, such as MySQL, Apache, the HTML and anything inside PHP.
In the past I have set PHP to output in "UTF-8", but which collation does this match in MySQL? I'm thinking it's one of the UTF-8 ones, but I have used
utf8_unicode_ci
,utf8_general_ci
, andutf8_bin
before