Ticket #34400

SFD quoting conventions are nonsensical, and corrupt pickled Python data

Open Date: 2014-09-30 15:58 Last Update: 2014-09-30 16:01

Open [Owner assigned]
5 - Medium
5 - Medium


FontForge SFD files use at least four different quoting conventions for preserving field values that may contain special ASCII characters, newlines, or binary data: 1. UTF7 (used, for instance, for FontLogs); 2. Base85 (used for some binary image data, I think); 3. backslash continuation lines (read by nlgetc(); I don't know what, if anything, writes this); and 4. backslash for escaping literal quotes, backslashes and newlines (written by SFDPickleMe(), read by SFDUnPickle() which internally calls nlgetc(), therefore removing a second layer of backslashes that were never applied in the first place and corrupting the data! ). See https://github.com/fontforge/fontforge/issues/1756#issuecomment-57266522 . Note that the sequence "backslash newline" will translate to "empty string" in convention 3 and to "literal newline" in convention 4, so we really must know which of these conventions applies at any given moment in order to read the same literal data that we wrote.

See also https://github.com/fontforge/fontforge/issues/498 , in which a user willfully ignores the reason UTF7 was specified in the first place.

The backslash-based formats are likely to cause problems because they represent newlines as sequences of characters that still contain newlines, thus defeating one of the main purposes of quoting the newlines at all. Anyone attempting to read past a field encoded that way must decode every character of it to know what is really the next field name and what is just literal text that happens to look like the start of another field. A pure line-based skipping loop will sometimes see what looks like the start of a new field in the middle of a backslash-escaped field value. I think the code inherited from FontForge does sometimes skip over fields without decoding them, for instance when ignoring unwanted undo/redo data; I don't know if there is any code that would attempt to skip over pickled Python data in a way that would be screwed up by this issue, but it certainly looks like trouble in the making.

FontAnvil ought to use one and only one method for encoding field values, and it ought to produce strings that can be skipped over without decoding every character, i.e., it ought not to ever write literal newline characters except at the actual end of the field value. Base85 seems undesirable in any context where the data is basically text, because it's not human-readable. UTF7 might be good, if we change the implementation to write literal characters wherever possible within the constraints (right now, I think it ends up Base64-ing everything whether necessary or not); it also would be important to make sure that if we ever need to write binary data that is not valid Unicode, the UTF7 implementation can do something sensible with it. Despite the undesirability of adding yet another quoting format, it may be really necessary to create a custom quoting convention in order to handle all the cases we must handle. However, implementing a global normalization of encoding formats would mean writing a file format FontForge cannot read. If we wish to read FontForge files, we must retain the ability to at least decode all the wacky encodings FontForge uses. If we wish to write files FontForge can read, we must also retain the ability to write all those encodings.

Short term: deal with the fact that our existing code inherited from FontForge screws up pickled Python data. One idea might be to wait and see how FontForge resolves this and then imitate them. FontAnvil doesn't use pickled Python data at all, but at the moment, the inherited code tries to preserve such data through a load/save cycle, and it can't, because the double-unquoting corrupts the data. We could just automatically remove all pickled Python data from input, making best-effort guesses at where the fields end. We also could, whether FontForge attempts this or not, try to preserve pickled Python data properly without corruption.

Longer term: FontAnvil native format apparently needs to diverge from FontForge native format. We need to decide whether we want to continue being able to read and write FontForge native format, and to what extent FontAnvil native format should look like FontForge native format. It's tempting to just bump the version number in the SFD files, and then write them the way they ought to be, but that will cause political problems. An idea might be to call the new format "SFA format," with an appropriate extension, bumped-up internal version numbers, and whatever markup it takes to make sure FontForge will not try to read it, while sacrificing any ability for FontAnvil to read and write FontForge SFD files. But users may be unhappy if FontAnvil cannot at least read FontForge-generated SFD files, and that might harm FontAnvil adoption. Read but not write might be a reasonable policy, but it means continuing maintenance of the painful decoding mess. Maybe just keeping the current mess while not making it any worse and attempting to gently encourage FontForge to clean up their own act, is actually a better idea.

The immediate bug needs to be fixed somehow. The longer-term policy questions require more thought.

Ticket History (3/3 Histories)

2014-09-30 15:58 Updated by: mskala
  • New Ticket "SFD quoting conventions are nonsensical, and corrupt pickled Python data" created
2014-09-30 15:59 Updated by: mskala
  • Details Updated
2014-09-30 16:01 Updated by: mskala
  • Details Updated

Attachment File List

No attachments


You are not logged in. I you are not logged in, your comment will be treated as an anonymous post. » Login