Bug#553490: wdiff: Does not handle UTF-8 properly (fwd)

October 20th, 2011 - 07:10 am ET by Santiago Vila | Report spam
Hello.

I received this from the Debian bug system.
I've checked and the current version (1.0.1) still shows the bug.
[ Please keep the Cc: lines when replying, thanks ].

[ Apologies to the submitter for taking so long to process this ]

- Forwarded message -
From: Josh Triplett <josh@joshtriplett.org>
To: Debian Bug Tracking System <submit@bugs.debian.org>
Date: Sat, 31 Oct 2009 11:39:08 -0700
Subject: wdiff: Does not handle UTF-8 properly

Package: wdiff
Version: 0.5-19
Severity: normal

"wdiff -a" uses backspace and overstrike to provide emphasis; thus, it
will emphasize 'x' by printing 'x^Hx'. When it encounters a UTF-8
character, it does this for each byte, rather than for each character;
thus, emphasis of <E2><80><99> (U+2019 RIGHT SINGLE QUOTATION MARK)
looks like '<E2>^H<E2><80>^H<80><99>^H<99>', when it should look
like '<E2><80><99>^H<E2><80><99>'.

- Josh Triplett

[...]



To UNSUBSCRIBE, email to debian-bugs-dist-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
email Follow the discussionReplies 1 replyReplies Make a reply

Similar topics

Replies

#1 Martin von Gagern
October 20th, 2011 - 03:10 pm ET | Report spam
This is an OpenPGP/MIME signed message (RFC 2440 and 3156)

Dear Santiago, Dear Josh,

I've already noticed that bug in your bug tracker, and added it to the
wdiff bug tracker at Savannah: https://savannah.gnu.org/bugs/?34224

Right now, I'm not sure how best to handle this case. Unicode support is
a big problem for the current wdiff implementation, in many ways. For
example, I guess that the most sensible way to really simulate
overstrike printing would be detecting grapheme clusters, i.e. even
treat sequences ofmultiple code points as a single entity if some of the
codepoints are combining.
http://www.unicode.org/reports/tr29...Boundaries has the
details on this, but I don't think I'll implement this in wdiff myself.
I've been toying with the idea of writing wdiff up from scratch with
stuff like this in mind, using ICU break iterators or similar. Won't
happen too soon, though.

I'm also not sure what versions of less are behaving in what ways. For
one, I doubt that all of them will know about grapheme clusters when
reading their input, so they might fail to turn it back into character
attributes as expected. I also think that most less implementations
these days will handle terminal control codes just fine, particularly if
called as "less -R". So that overstriking thing might be obsolete in any
case.

Therefore I hope to roll a release soon which will pass terminal control
sequences to less, thus avoiding that overstrike stuff. I'll have to
give a bit more thought to the best combination of configure switches,
environment variables and command line options, though.

Greetings,
Martin von Gagern







To UNSUBSCRIBE, email to
with a subject of "unsubscribe". Trouble? Contact
email Follow the discussion Replies Reply to this message
Help Create a new topicReplies Make a reply
Search Make your own search