A community in which webmasters can ask for help with topics such as PHP coding , MySQL , IT jobs, web design, IT security.
Current location:homephp forumphp talk in 2009 yearSome Basic Python Questions - page 1
User InfoPosts
Some Basic Python Questions#1
I(m a total python noob so please bear with me. I want to have python scan a page of html and replace instances of Microsoft Word entities with something UTF-8 compatible.

My question is, how do you do that in Python (I(ve Googled this but haven(t found a clear answer so far)? I want to dip my toe in the Python waters so I figure something simple like this is a good place to start. It seems that I would need to:


load text pasted from MS Word into a variable
run some sort of replace function on the contents
output it


In PHP I would do it like this:

$test = $_POST[(pasted_from_Word(]; //for example “Going Mobile”

function defangWord($string)
{
$search = array(
(chr(0xe2) . chr(0x80) . chr(0x98)),
(chr(0xe2) . chr(0x80) . chr(0x99)),
(chr(0xe2) . chr(0x80) . chr(0x9c)),
(chr(0xe2) . chr(0x80) . chr(0x9d)),
(chr(0xe2) . chr(0x80) . chr(0x93)),
(chr(0xe2) . chr(0x80) . chr(0x94)),
(chr(0x2d))
);

$replace = array(
"‘",
"’",
"“",
"”",
"–",
"—",
"–"
);

return str_replace($search, $replace, $string);
}

echo defangWord($test);


How would you do it in Python?

EDIT: Hmmm, ok ignore my confusion about UTF-8 and entities for the moment. The input contains text pasted from MS Word. Things like curly quotes are showing up as odd symbols. Various PHP functions I used to try and fix it were not giving me the results I wanted. By viewing those odd symbols in a hex editor I saw that they corresponded to the symbols I used above (0xe2, 0x80 etc.). So I simply swapped out the oddball characters with HTML entities. So if the bit I have above already IS UTF-8, what is being pasted in from MS Word that is causing the odd symbols?

EDIT2: So I set out to learn a bit about Python and found I don(t really understand encoding. The problem I was trying to solve can be handled simply by having sonsistent encoding from end to end. If the input form is UTF-8, the database that stores the input is UTF-8 and the page that outputs it is UTF-8... pasting from Word works fine. No special functions needed. Now, about learning a little Python...

posted date: 2009-04-15 17:41:00


Re: Some Basic Python Questions#2
I had made out the solution of this problem. click to view my topic...

hope that hepls.

posted date: 2009-04-15 17:41:01


Re: Some Basic Python Questions#3
+1: "defangWord()"... I love it! :-)

posted date: 2009-04-15 17:43:00


Re: Some Basic Python Questions#4
The Python code has the same outline.Just replace all of the PHP-isms with Python-isms.Start by creating a File object. The result of a file.read() is a string object. Strings have a "replace" operation.

posted date: 2009-04-15 17:47:00


Re: Some Basic Python Questions#5
Your best bet for cleaning Word HTML is using HTML Tidy which has a mode just for that. There are a few Python wrappers you can use if you need to do it programmatically.

posted date: 2009-04-15 17:53:00


Re: Some Basic Python Questions#6
As S.Lott said, the Python code would be very, very similar—the only differences would essentially be the function calls/statements.I don(t think Python has a direct equivalent to file_get_contents(), but since you can obtain an array of the lines in the file, you can then join them by newlines, like this:sample = (\n(.join(open(test, (r().readlines())EDIT: Never mind, there(s a much easier way: sample = file(test).read()String replacing is almost exactly the same as str_replace():sample = sample.replace(search, replace)And outputting is as simple as a print statement:print defang_word(sample)So as you can see, the two versions look almost exactly the same.

posted date: 2009-04-15 17:54:00


Re: Some Basic Python Questions#7
file('foo.txt').read()

posted date: 2009-04-15 18:09:00


Re: Some Basic Python Questions#8
First of all, those aren(t Microsoft Word entities—they are UTF-8. You(re converting them to HTML entities.The Pythonic way to write something like:chr(0xe2) . chr(0x80) . chr(0x98)would be:(\xe2\x80\x98(But Python already has built-in functionality for the type of conversion you want to do:def defang(string): return string.decode((utf-8().encode((ascii(, (xmlcharrefreplace()This will replace the UTF-8 codes in a string for characters like with numeric entities like “.If you want to replace those numeric entities with named ones where possible:import refrom htmlentitydefs import codepoint2namedef convert_match_to_named(match): num = int(match.group(1)) if num in codepoint2name: return "&%s;" % codepoint2name[num] else: return match.group(0)def defang_named(string): return re.sub((&#(\d+);(, convert_match_to_named, defang(string))And use it like so:>>> defang_named((\xe2\x80\x9cHello, world!\xe2\x80\x9d()(“Hello, world!”(To complete the answer, the equivalent code to your example to process a file would look something like this:# in Python, it(s common to operate a line at a time on a file instead of# reading the entire thing into memorymy_file = open("test100.html")for line in my_file: print defang_named(line)my_file.close()Note that this answer is targeted at Python 2.5; the Unicode situation is dramatically different for Python 3+.I also agree with bobince(s comment below: if you can just keep the text in UTF-8 format and send it with the correct content-type and charset, do that; if you need it to be in ASCII, then stick with the numeric entities—there(s really no need to use the named ones.

posted date: 2009-04-15 18:10:00


Re: Some Basic Python Questions#9
Good call—edited.

posted date: 2009-04-15 18:11:00


Re: Some Basic Python Questions#10
+1 for xmlcharrefreplace — there is no need for HTML named entities today really. But really, leave the UTF-8 alone, smart-quotes intact. As long as you serve it with the correct ‘charset’ header/meta-tag there is no problem.

posted date: 2009-04-15 18:14:00


Re: Some Basic Python Questions#11
+1 for pointing out that the entities are UTF-8 and not some MS weirdness ;-) (and for a well-written answer overall, too)

posted date: 2009-04-15 18:48:00


Re: Some Basic Python Questions#12
I'm confused. The document I am importing in the example is full of strange symbols that correspond to MS Word curly quotes. If I drop them straight into a page with UTF-8 encoding I get strange symbols. If I convert them using my example code they render fine. So, what are they before I convert?

posted date: 2009-04-15 21:46:00


select page: « 1 2 »
Copyright ©2008-2017 www.momige.com, all rights reserved.