OrdaSoft forum
Welcome, Guest
Please Login or Register.    Lost Password?
utf8mb4_unicode_ci Collation - issues (0 viewing) 
Joomla Translate component support
Go to bottom Favoured: 0
TOPIC: utf8mb4_unicode_ci Collation - issues
joomleb (User)
Gold Boarder
Posts: 192
graphgraph
User Offline Click here to see the profile of this user
utf8mb4_unicode_ci Collation - issues 2 Years, 3 Months ago
Karma: 0  
Hi guys,
PHP 7.3.23 + MySQLi 10.1.40-MariaDB-cll-lve DB + Joomla 3.9.22 + SEF Translate 4.7 Commercial
- on a DB set as latin1_swedish_ci
- with different Tables collation: latin1_swedish_ci / utf8_general_ci / utf8mb4_general_ci / utf8mb4_unicode_ci

First - I change the entire DB collation through phpMyAdmin, following the "Changing the default collation for a database" guide, by selecting:
utf8mb4_unicode_ci + "Change all tables collations" + "Change all tables columns collations"

Second - Admin Tools > Change Database Collation (on the existing tables) > Choose a collation > utf8mb4_unicode_ci > Apply

15 of all the 356 Tables are not moved and are always as utf8_general_ci
As you can see 2 are from SEF Translate.

1 - Won't this create any kind of issues ? Please, How to fix these tables ?
Or at the moment I just have to ignore them and leave them as they are after the transformation ?

Now, I read about it and, as far as I understand (in particular many thanks to Nicholas' help, Akeeba), it is a MySQL limitation, but also a "bad implementation" from you, the technical explanation I found - I have the DB, the Table and the Column Collation / character set format. Often it depends on:

MySQL has a limit of 767 bytes for each index. Many extensions use a VARCHAR(255) column which is also indexed without a character limit.
This essentially tells MySQL to create an index with the maximum amount of data the column can hold. Plain old UTF-8 is up to 3 bytes per character and the column is set to use 255 characters therefore the index size is 3x255 = 765 bytes which is less than the 767 byte limit.

If this column were to be converted to UTF8MB4 each character can take up to 4 bytes (that's what MB4 means: MutiByte, up to 4 bytes per character). Therefore the index size would need to be 4*255 = 1020 bytes. However this is bigger than MySQL's 767 byte limit for indices. Therefore the table cannot be converted to UTF8MB4.

There are two solutions BUT they must be implemented by the developer of the affected extension, you:
- A - to set the column to VARCHAR(191). The index in this case is 4*191 = 764 bytes which is under MySQL's limit. This is perfectly fine for most practical cases. Very few title fields need to go that big, even in languages which use a massive amount of characters per word (e.g. German and Greek)
- B - to keep the columns as VARCHAR(255) but to set a character limit to the index. This is the recommended solution for performance reasons but its applicability depends on the data shape. For example, a developer may choose to create an index on the column but limit it to 64 characters. This will only be a problem if you expect to have multiple rows with their first 64 characters in common. In fact, often, since it is very difficult to determine when creating a mass distributed extension which can be used in contexts, the developer can't even imagine and he prefers to convert the column to VARCHAR(191).

Now, if this is the case, Joomla added UTF8MB4 support nearly three years ago. The real reason for UTF8MB4 support in Joomla is security, because it mitigates a lot of trickery which could lead in SQL injections. Joomla 4 standard is (the old) utf8mb4_unicode_ci.
You should already have addressed it and I'm sure you are on it. It seems really a simple implementation and I'm really surprised that here we are still speaking about "utf8_general_ci", Do you agree ?

2 - Please, What is your plan on it ?
Logged Logged
Last Edit: 2020/11/02 23:14 By joomleb.
The administrator has disabled public write access.
admin (Admin)
Admin
Posts: 7946
graph
User Offline Click here to see the profile of this user
Re:utf8mb4_unicode_ci Collation - issues 2 Years, 3 Months ago
Karma: 97  
Hi

Yes you right,
But 90% of our clients not know what is it utf8mb4 or utf8
Yes, they no need know this.
More bad when about this not know Hosting provider.

and when our client install our QuickStart packages and see error:
Error in mySQL syntax near at "utf8mb4" - because of hosting not support utf8mb4

Regards
OrdaSoft team
Logged Logged
The administrator has disabled public write access.
joomleb (User)
Gold Boarder
Posts: 192
graphgraph
User Offline Click here to see the profile of this user
Re:utf8mb4_unicode_ci Collation - issues 2 Years, 3 Months ago
Karma: 0  
Hi guys,
"...because of hosting not support utf8mb4..." = I'm scared, we are speaking about MySQL 5.5.3 released in early 2010, more than 10 years ago, in a binary language: "another era", "when dinosaurs walked the earth"

Anyway, Please:

1 - Can I safely, without causing any SEF Translate issue, process a manually transformation through phpMyAdmin ?
As far as I see there are just 2 tables (3 columns) that need a manual transformation:
_seftranslate_files > url > VARCHAR(255)
_seftranslate_files > file_name > VARCHAR(255)
_seftranslate_urls > url > VARCHAR(255)
Is it better to do it by changing VARCHAR(255) to VARCHAR(191) or by setting a 767 indexes limit ? Please, Can you address me on the right (better) way ?

2 - I want to share with you a constructive discussion I'm having with Nicholas (Akeeba) supporting Admin Tools where we are speaking also about the security reasons, you can read for complete at this link, anyway an important piece:

"...Now, I have two comments to make and please link these third party developers to this reply of mine.

First, I'd like to say that everything I have said here and everything I am about to say is already explained in minute detail in my PR which added backwards compatible UTF8MB4 support to Joomla 3.5.0, submitted 5 ½ years ago on June 2015.

Regarding the practical considerations of UTF8MB4 vs UTF8 support in each extension, it's about which characters can be stored with each collation. When using a utf8_*_ci collation you are essentially only supporting the Unicode Basic Multilingual Plane (BMP). This means that four byte Unicode characters such as Emoji and extended Chinese, Japanese and Korean characters are not supported and instead converted to question marks at best or result in truncated data at worst. Joomla has supported UTF8MB4 for more than five years and people expect these characters to work correctly with it and all of its extensions. There is no good reason why I shouldn't be able to send a mailing that contains an Emoji, make a product description that includes some extended Korean character or have automatic translation fail when using Unicode aliases in Global Configuration for an article whose title uses some less often used traditional Chinese character.

As to how they can convert their extensions, all they have to do is stop supporting Joomla 3.4 and earlier. I would say that's a reasonable thing to do considering that these versions of Joomla are less than 0.4% of all installed Joomla sites. Joomla 3.5 and later allows you to simply write your SQL queries using the utf8mb4_unicode_ci collation and it will magically and automatically convert that to utf8_unicode_ci collation if the database server and/or PHP MySQL driver used on the site do not support utf8mb4. From a third party developer's perspective you DO NOT have to worry about whether the database server supports utf8mb4 or not. That's Joomla's problem. In fact, Joomla's own SQL code does the same – it uses utf8mb4. It still works on old versions of MySQL and PHP which don't support UTF8MB4. This alone should have gotten these developers thinking instead of waving off utf8mb4 support as infeasible and prompt them to take a look at Joomla's code where they would have easily spotted the convertUtf8mb4QueryToUtf8() method in the core Joomla database driver code.

So, it's absolutely feasible, it is necessary and there are no more excuses not to do it. If they really want to support Joomla 3.0 to 3.4 (less than 0.4% of Joomla installations) they can of course create two separate packages the only difference between them being the SQL files. This can be trivially automated, so it's not even a valid excuse. If they still refuse to upgrade to utf8mb4 then the problem is with them and only them...."
Logged Logged
Last Edit: 2020/11/03 16:33 By joomleb.
The administrator has disabled public write access.
admin (Admin)
Admin
Posts: 7946
graph
User Offline Click here to see the profile of this user
Re:utf8mb4_unicode_ci Collation - issues 2 Years, 3 Months ago
Karma: 97  
Hi

Sorry, Please check, for me most important ROBUST work our software for all site developers.

Nobody can use all soft. You will check first 1-5 - and use the best from them. And only if first 5 products not good - you will select and check next. You not check why product nor work on your PC.
This may be:
- complication of software
- your error
- bad usability
- error of hoster provider
- some 3 party incompatibility.
....

You will check this all only if product which you select have not alternative.
Else you will select other product without errors.
We had errors with utf8mb4 - And I know, to us write only 1 from 100 - other people will prefer use products from other developers without errors.
Sop if we have so errors we try use solutions which work for 99% of our customers.

Bad if you change url - the URL may have very many parameters so may will be very long. Please change index size.
"File field" you may change

Regards
Andrew
Logged Logged
The administrator has disabled public write access.
joomleb (User)
Gold Boarder
Posts: 192
graphgraph
User Offline Click here to see the profile of this user
Re:utf8mb4_unicode_ci Collation - issues 2 Years, 3 Months ago
Karma: 0  
Hi Andrew,
many thanks for your quick answer. In first of all:
- I want to put you in evidence that I wrote: "...I want to share with you a constructive discussion..." = I'm just reporting, friendly, from a constructive point of view
- I didn't understand all what you wrote me now. I'm going to answer for as far as I understood. If you don't find all the informations, Please, just let me know.

I suppose that the "mess" we have in the collations described in the first post depends either on our mistake or on the installation of an Extension. We're investigating on it and I'll let you know.

So, to solve it I'm asking you:

1 - What should be the right DB / Tables / Columns collations on a new SEF Translate 5.x installation ?
Is utf8mb4_unicode_ci collation the right one ?

2 - (without causing any SEF Translate issue) Can I safely process a manually transformation to utf8mb4_unicode_ci collation through phpMyAdmin ?

3 - Given that at this moment the Tables _seftranslate_files and seftranslate_urls are still empty - No datas are stored there.

4 - As far as I see, just these 2 Tables (3 Columns) need a "manual intervention":
_seftranslate_files > url > VARCHAR(255)
_seftranslate_files > file_name > VARCHAR(255)
_seftranslate_urls > url > VARCHAR(255)
Am I right ?
As far as I understood, you told me that the best way to do it is to change index size, Am I right ? By setting a 767 size, right ?
Please, Can you address me on the right way ?
Logged Logged
The administrator has disabled public write access.
admin (Admin)
Admin
Posts: 7946
graph
User Offline Click here to see the profile of this user
Re:utf8mb4_unicode_ci Collation - issues 2 Years, 3 Months ago
Karma: 97  
Hi

"...I want to share with you a constructive discussion..."
Sorry without me...

FIRST RULE: If all work good no need change nothings, because ..

1 - What should be the right DB / Tables / Columns collations
You use Chine language, so for you maybe this(utf8mb4) may sense

2 - (without causing any SEF Translate issue)
For Now I not know which issues possible after transform to utf8mb4

_seftranslate_files and seftranslate_urls
These tables use for redirect method. If you not use redirect method, SefTranslate at all not use Database.

4 - As far as I see, just these 2 Tables (3 Columns) need a "manual intervention":
If this important for you you may, remove indexes for these tables, do collation change and recreate indexes again

Regards
OrdaSoft team
Logged Logged
The administrator has disabled public write access.
Go to top