Monday, November 9, 2009

What's under the hood: Supercompact RTFs

Have you ever wondered what are those "supercompact RTFs" enabled by default in the "Tools | Options..." window of Atlantis?



Are there two flavors of the RTF format?

Not actually. There is a single Rich Text Format standard from Microsoft. But RTF files can be generated differently.

To begin with, any RTF document generated by Atlantis, is a valid RTF file compliant with the RTF specification. But this specification allows some document elements to be saved in multiple different but still valid ways. Atlantis simply exploits the possibilities offered by the RTF specification to generate smaller RTF files.

You might wonder why then offer to save "non-supercompact" RTFs if "supercompact" RTFs are always smaller than their "traditional" counterparts? The answer is very simple: not every word processing application supports all the features of the RTF specification. Consequently not every software can display supercompact RTFs correctly. Luckily, this mainly applies to older word processors. All major modern word processors (including MS Word and OpenOffice) display the supercompact RTFs of Atlantis as intended. So you might need to uncheck the "Save supercompact RTF documents" option in Atlantis only if you are going to send your RTF documents to someone whose word processor cannot display supercompact RTFs correctly.

OK, then what's exactly the difference between "supercompact RTFs" and "non-supercompact RTFs"? If you need technical details, here they are:

1) A paragraph in traditional RTFs is normally terminated by a sequence of 6 characters (bytes):

\par<carriage_return><line_feed>


"<carriage_return><line_feed>" (or CR/LF - two bytes with codes 13 and 10) is a standard combination terminating any line in most plain text files in Windows.

A paragraph in supercompact RTFs is terminated by 2 characters (bytes) only:

\<carriage_return>


You save 4 bytes per paragraph if you save the document as supercompact.

2) Most non-English characters in traditional RTFs are stored in a 4-byte format:

\'<NN>


<NN> is a two-byte character code in hexadecimal format.

In supercompact RTFs, such characters are normally saved as a single byte containing the ANSI code of a character.

For example, the "é" letter (e-acute) is saved to traditional RTFs as

\'e9


This "é" letter is stored within supercompact RTFs as a single byte with code 233.

So, you save 3 bytes per a non-English character if you save as supercompact.

3) Finally, a difference which makes supercompact files really a useful option.

Pictures are always stored in traditional RTFs as Windows metafiles.

In supercompact RTFs, pictures are stored as PNGs or JPEGs.

In most cases, metafiles are much bigger than their counterparts in the PNG or JPEG format. Have I said "much bigger"? I actually wanted to say "MUCH-MUCH bigger". A tenfold difference in size between the metafile and PNG versions of the same picture is a completely normal thing.

So if your RTF document contains pictures, you can DRAMATICALLY shrink its file size by saving as supercompact RTF.

2 comments:

ralphg said...

CR is short for "carriage return," not "caret return."

The term comes from the days of the typewriter, where the typist pushed the carriage return lever to return the carriage (the roller holding the paper) back to the left margin location -- to start the next line of text.

Electric typewriters automatically performed the carriage return when the carriage reached the right margin. In this case, the typist manually did a second CR to start a new paragraph.

Atlantis Word Processor Team said...

Of course, you're right. I use the "caret" word so often, it just pops out. It's on the top of my mind. Will be corrected. Thanks.