Bug#646246: gscan2pdf ocropus and html-entities

October 22nd, 2011 - 06:10 pm ET by noreply | Report spam
Looks like this is a utf-8 problem.

The following patch seems to fix editing problem
and conversion problems on pdf-export.

tesseract texts fail on editing and changing the page.
same as before... problem still not found.
however... next bug report will bring a real improvement.


/usr/share/perl5/Gscan2pdf/Page.pm 2011-08-27 07:00:41.000000000 +0200
+++ /usr/share/perl5/Gscan2pdf/Page.pm 2011-10-22 23:57:19.492261844 +0200
@@ -11,6 +11,7 @@
use HTML::TokeParser;
use HTML::Entities;
use Image::Magick;
+use Encode;
use utf8;

BEGIN {
@@ -135,7 +136,7 @@
}
}
if ( $token->[0] eq 'T' and $token->[1] !~ /^\s*$/ ) {
- $text = HTML::Entities::decode_entities( $token->[1] );
+ $text = HTML::Entities::decode_entities(decode_utf8( $token->[1] ));
chomp($text);
}
if ( $token->[0] eq 'E' ) {




To UNSUBSCRIBE, email to debian-bugs-dist-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
email Follow the discussionReplies 1 replyReplies Make a reply

Replies

#1 Jeffrey Ratcliffe
October 23rd, 2011 - 03:00 pm ET | Report spam
Thanks for the patch.

But the bug itself is still not clear to me.

What - exactly - is happening? - step by step, please

And what do you expect to happen?

Regards

Jeff



To UNSUBSCRIBE, email to
with a subject of "unsubscribe". Trouble? Contact

Similar topics