A community in which webmasters can ask for help with topics such as PHP coding , MySQL , IT jobs, web design, IT security.
Current location:homephp forumphp talk in 2009 yearregex to match any UTF character excluding punctuation - page 1
User InfoPosts
regex to match any UTF character excluding punctuation#1
I(m preparing a function in PHP to automatically convert a string to be used as a filename in a URL (*.html). Although ASCII should be use to be on the safe side, for SEO needs I need to allow the filename to be in any language but I don(t want it to include punctuation other than a dash (-) and underscore (_), chars like *%$#@"( shouldn(t be allowed.

Spaces should be converted to dashes.

I think that using Regex will be the easiest way, but I(m not sure it how to handle UTF8 strings.

My ASCII functions looks like this:

function convertToPath($string)
{
$string = strtolower(trim($string));
$string = preg_replace((/[^a-z0-9-]/(, (-(, $string);
$string = preg_replace((/-+/(, "-", $string);
return $string;
}


Thanks,

Roy.

posted date: 2009-04-12 01:21:00


Re: regex to match any UTF character excluding punctuation#2
I had made out the solution of this problem. click to view my topic...

hope that hepls.

posted date: 2009-04-12 01:21:01


Re: regex to match any UTF character excluding punctuation#3
If UTF-8 mode is selected you can select all non-Letters (according to the Unicode general category - please refer to the PHP documentation Regular Expression Details) by using/\P{L}+/so I(d try the following (untested):function convertToPath($string){ $string = mb_strtolower(trim($string), (UTF-8(); $string = preg_replace((/\P{L}+/(, (-(, $string); $string = preg_replace((/-+/(, "-", $string); return $string;}Be aware that you(ll get prolems with strtolower() on UTF-8 strings as it(ll mess with you multi-byte characters - use mb_strtolower() instead.

posted date: 2009-04-12 01:35:00


Re: regex to match any UTF character excluding punctuation#4
I think that for SEO needs you should stick to ASCII characters in the URL.In theory, many more characters are allowed in URLs. In practice most systems only parse ASCII reliable.Also, many automagically-parse-the-link scripts choke on non-ASCII characters. So allowing URLs with non-ASCII characters in your URLs drastically reduces the change of your link showing up (correctly) in user generated content.(if you want an example of such a script, take a look at the stackoverflow script, it chokes on parenthesis for example)You could also take a look at:How to handle diacritics (accents) when rewriting ‘pretty URLs’ The accepted solution there is to transiterate the non-ASCII characters:<?php $text = iconv((UTF-8(, (US-ASCII//TRANSLIT(, $text);?>Hope this helps

posted date: 2009-04-12 01:36:00


Re: regex to match any UTF character excluding punctuation#5
You're right on this one - leaving non-ASCII characters in a URL will cause problems as you have to track the URL encoding of the client's browser (which is not very consistent). But please note, that iconv-transliteration requires the correct locale to be set (UTF-8 encoding) - using Windows this

posted date: 2009-04-12 01:48:00


Re: regex to match any UTF character excluding punctuation#6
is a show-stopper.

posted date: 2009-04-12 01:49:00


Re: regex to match any UTF character excluding punctuation#7
This will also replace things like accents (that normally are non-spacing) with a '-'. So 'Aït Ben Haddou' will become 'Ai-t Ben Haddou'.

posted date: 2009-04-12 01:59:00


select page: « 1 »
Copyright ©2008-2017 www.momige.com, all rights reserved.