[Question] Unicode Support

Asirt 09-19-2004 04:06 PM
I was wondering if the Paradigm City forums have some sort of Unicode support. If not, is there a possiblity that it may be supported in the future? I'm just curious, as other forums I go to has unicode support. It may be useful here, but I wanted to know if this would be a possiblity in the future.
Krang 09-19-2004 11:53 PM
I tried implementing it in the past, but it ended up messing up some older posts, so I changed it back. If I can come up with some kind of converter that will change those old posts to Unicode (or if it really doesn't bother anyone that there will occasionally be a strange character or two in older posts), then I'd be glad to change it.
X Prime 09-20-2004 06:12 AM
quote:
Originally posted by Krang
I tried implementing it in the past, but it ended up messing up some older posts, so I changed it back. If I can come up with some kind of converter that will change those old posts to Unicode (or if it really doesn't bother anyone that there will occasionally be a strange character or two in older posts), then I'd be glad to change it.


Did you try switching the option Remove certain ASCII characters under Censorship and Banning?
Krang 09-21-2004 12:02 AM
quote:
Originally posted by X Prime
Did you try switching the option Remove certain ASCII characters under Censorship and Banning?

I tried that, but unfortunately it has no effect. The main problem is when someone copies a post from Word (or similar programs), and leaves in the special formatting. For example:

“This contains Word’s formatting.”

becomes:

?This contains Word?s formatting.?
Zola 09-21-2004 01:54 AM
quote:
Originally posted by Krang
quote:
Originally posted by X Prime
Did you try switching the option Remove certain ASCII characters under Censorship and Banning?

I tried that, but unfortunately it has no effect. The main problem is when someone copies a post from Word (or similar programs), and leaves in the special formatting. For example:

“This contains Word’s formatting.”

becomes:

?This contains Word?s formatting.?


I have the answer, ROTFL!

I actually figured this out for SBO's board after extensive consultation with the PHP gurus on one of my mailing lists.

It turned out to be really simple. Any text input or text area in which you want to permit the characters in simply needs the addition of enctype="multipart/form-data" in the main form tag. It can be added anywhere, just like it doesnt' matter where you put form method or form action, as long as it's in there.

Smile
Krang 09-21-2004 02:15 AM
quote:
Originally posted by Zola
I have the answer, ROTFL!

I actually figured this out for SBO's board after extensive consultation with the PHP gurus on one of my mailing lists.

It turned out to be really simple. Any text input or text area in which you want to permit the characters in simply needs the addition of enctype="multipart/form-data" in the main form tag. It can be added anywhere, just like it doesnt' matter where you put form method or form action, as long as it's in there.

Smile

Thanks for the suggestion, but this problem is actually the opposite of that. The board has no problem accepting Unicode data in its forms when the encoding is changed to UTF-8, but when I change the encoding, all the Word-style quotation marks and apostrophes in older posts are changed to question marks. I guess the best solution would be to find the problem-causing characters and search/replace them in the database...
X Prime 09-21-2004 04:31 AM
quote:
Originally posted by Krang
quote:
Originally posted by Zola
I have the answer, ROTFL!

I actually figured this out for SBO's board after extensive consultation with the PHP gurus on one of my mailing lists.

It turned out to be really simple. Any text input or text area in which you want to permit the characters in simply needs the addition of enctype="multipart/form-data" in the main form tag. It can be added anywhere, just like it doesnt' matter where you put form method or form action, as long as it's in there.

Smile

Thanks for the suggestion, but this problem is actually the opposite of that. The board has no problem accepting Unicode data in its forms when the encoding is changed to UTF-8, but when I change the encoding, all the Word-style quotation marks and apostrophes in older posts are changed to question marks. I guess the best solution would be to find the problem-causing characters and search/replace them in the database...


Does UTF-8 support HTML 4.01 Entities...? If not, that will probably explain why this happens.
Krang 09-22-2004 02:30 AM
quote:
Originally posted by X Prime
Does UTF-8 support HTML 4.01 Entities...? If not, that will probably explain why this happens.

Actually it does, but by doing some testing I figured out that that wasn't the problem. When the forum is switched to UTF-8 encoding, the symbols are switched back from HTML entities to regular symbols, and apparently most browsers don't like that and display them as question marks instead...

Here's what I did for my test:

I posted this in windows-1252 encoding:
quote:
“This contains Word’s formatting.”

And here was the source code:
code:
1:
“This contains Word’s formatting.”

Then I switched to UTF-8 and it appeared as:
quote:
?This contains Word?s formatting.?

I posted the same post again while in UTF-8, and it looked normal. Here was the source code so far:
code:
1:
2:
“This contains Word’s formatting.”
“This contains Word’s formatting.” (UTF-8 encoding)

Then I switched back to windows-1252, and here was the result:
quote:
“This contains Word’s formatting.”
“This contains Word’s formatting.†(UTF-8 encoding)

So apparently WBB has problems when switching between character sets. I'll have to take a look at the code and how the posts are stored in the database to see if I can figure out the problem...
X Prime 09-23-2004 02:22 PM
What you are probably looking for is in lib/functions.php

Functions htmlconverter and rehtmlconverter.

Personally, I think its the HTML entities, no question.

I've read it repeatedly, and am stumped on what the problem could possibly be in this code.
Krang 09-23-2004 11:22 PM
Yeah, I was looking at those functions to see if anything could be done. However, I found out that the problem is with how they are stored in the database. In windows-1252, symbols are stored exactly as they appear on the screen, and in UTF-8, they are stored in multibyte Unicode. So I guess there are two solutions: One would be to convert all the existing symbols in the database to Unicode (although this would be a one-way conversion...), and the other would be to attempt to get the board to parse the windows-1252 characters first, then the Unicode characters (although this might slow the board down a little...).
X Prime 09-23-2004 11:27 PM
Well, it comes down to how important UTF-8 is to you. The second choice is overkill in my opinion either way.