Weird day. 2 days 15 hours ago
Getting rid of boxes filled with stuff we'll never use. It's going to be a lighter move to our new apt in September. 3 days 19 hours ago
@SteveDinn It's been happening to me too. Doesn't seem to happen nearly as much on FF3. 5 days 17 hours ago
The XBMC Linux port is really nice. It might be time for that mini-itx project, especially after finding this: http://tinyurl.com/2w5lfg 1 week 23 hours ago
Finally caught up with Doctor Who. RTD isn't afraid of the big cheesy. Still some nice moments though. 1 week 1 day ago
Ok. let's say I have to downgrade, if you will, from a file that was pushed to me as valid xml with UTF-8 character encoding, to iso-8859-1 as html. What follows is the only way I have found to do this which is both easily scriptable and that uses common command line tools in a *nix environment (including OSX!). If there's a better way, or if any of this frankencode can be improved on, please leave a note in the comments :)
The xml in question is coming from InDesign cs2, and I've seen to it that the xml tags used in the document are actually html, so that part is taken care of.
The next step, since there are special characters that will not translate from UTF-8, like fancy quotes and apostrophes, is to get them into the ascii equivalent *before* converting the rest of the file. Html Tidy seems to be the best tool for the job, using the -b flag to strip fancyness from the characters.
[After testing this a bit more, I've added the -wrap 0 flag and specified xml]
Tidy will also do a bunch more problem solving like balance tags and add proper html head, title and body tags (Specified xml so this does't happen). I don't need the xml declaration so it comes out on the fly using sed.
Once the file's in good shape I can convert it with iconv:
Two for loops in a bash script and we should have converted files sitting in a 'done' folder:
tidy -q -b -xml -wrap 0 -utf8 $f 2>&1 | more | sed '1,1d' > $f.tidy
done
for i in $( ls *xml.html ); do
/usr/bin/iconv -c -f UTF-8 -t iso-8859-1 $i >done/$i.conv
done
The more pipe is a workaround to suppress tidy messages found at Dave Raggett's Tidy page.
All that's left to do now is rename the files to something a bit shorter.