Follow

What do people consider 'forever' file formats. I.E file types that are open source and won't become obsolete quickly and can run on minimalist computers. I know that lots of people consider txt's to be pretty good for everything plain text, but what about odt and odf for word processing and spread sheets? Is ogg, flac or acc better than mp3 for music and audiobooks? Giff for images?

I'm a sci-fi writer and want to know what 'old tech' might still be useful in the future.

Thanks Mastodon!

Great ideas all! I'm writing a near-future (Climate Apocalypse) scene where a group of techies are trying to get a council of Grandmothers (whose job it is to look to the Seventh Generation - 250 years), to sign off on the neo-tribes building their own wrist-decks with the specs of a Commodore-64, 8-bit type machine. (I've heard of people building 8-bit chips in their DIY garage labs IRL) There is high tech in-world, but not likely to last past another supply chain break, post peak oil.

re: file formats 

@ewankeep a lot of modern formats are what we call container formats, where the underlying data can be encoded differently with the container being the same. the various mpeg formats are all containers, it's just that we've gotten into the habit of using v3 for audio-only, v4 for video; v4 audio is usually labelled m4a instead of MP4 since MP4 = video has become so ubiquitous.

ogg is the open-source equivalent and the container has not changed despite huge improvements in encoding quality. it can do audio and video.

PDF, for entirely legal reasons, has been guaranteed to be an open format by Adobe, since the US government made a deal with them to release their documents as PDF. so, there's a good chance PDF won't go away as a document layout format (i.e. display & printing) even if it's not an actual data format (editing)

OpenDocument I believe isn't as stable of a format but I don't actually think it's changed much recently. I'm not sure it'll stay around.

PNG is very likely to stay around although JPEG-XL is a very interesting new development that converts JPEG into a proper container with both lossy and lossless encoding. so, JPEG may effectively become the still image equivalent of MPEG

re: file formats 

@clarfonthey @ewankeep text and wordprocessor formats depend very much on what language culture you come from. 8-bit characters are barely adequate for encoding English. Languages with more complicated scripts are left out. There are some countries that can't agree on the script encoding to use: see Myanmar and the ongoing fight between international standards and local convention

@ewankeep The thing to consider with everything other than text (I like markdown better than just plain text, fwiw) is resolution and compression. The less of either, the more archival it is. Then also widespreadness of readers. flac is better than any lossy format, but mp3 players are ubiquitous.

@ewankeep Also consider DRM and paywalled content as a barrier for future folks doing research in old formats as well. Though as a practical matter, unless it's a plot point, I wouldn't specify the file type at all other than by "a document file" or "an image", etc.

@StevenSaus DRM is a good way make sure your tech doesn't survive!

@ewankeep Don't worry about it. The file format you choose is highly likely to outlast whatever media you save it onto anyways.

@ewankeep .flac and .mp3 both seem reasonably futureproof for music, depending on whether you have space for lossless compression or not. .wav seems pretty futureproof for lossless audio data (used by musicians/producers while tweaking a song’s final audio). As for the filetypes of “song projects” you’d see a musician or producer working with in a DAW program, they are an unending sea of conflicting standards that are continually becoming obsolete.

File formats for the far future 

@ewankeep I think FLAC could have a very long tail because it fills a niche most audio formats don't, and by being lossless it (largely) ensures that there isn't loss with subsequent generations of the files. So it's both more likely to stick around compared to other audio formats, and more useful in that long-term sticking around (which thus may further incline people to keep it alive as a format).

Perhaps an argument could be made for raw PCM though since that'd be easier for future generations to work out decoding thereof from the mere bits, assuming the maximum situation of "we somehow have the raw data of this file, with zero accompanying information or context". It's also a method that was somewhat independently invented a few times IIRC, and in use since some of the absolute earliest digital audio encoding, showing both the (relative) simplicity and the low level of technology potentially required for reproduction. (I mean hell, it was even within our lifetimes that there were digital devices out there that didn't have the horsepower to decode MP3s but could decode WAVs at the same sample rates.)

For images I'd be inclined to somewhat similarly imagine simpler formats to reverse-engineer sans context would be the best bet, although I'm unfamiliar with the nitty gritty of most of 'em. I'd actually imagine SVG files, at least uncompressed ones, would be pretty long-lasting, since it's potentially just plaintext describing geometric math.

File formats for the far future 

@m0xee @ewankeep Oh I mean even if future humans still just have the exact same kinda ears as us on average but just lost all civilization somehow or other, we might not land on quite the same kinda default "good enough" sample rate. But that's the kind of thing where movie science meets real science—by which I mean, having fun with turning big dials until dramatically things start to auditorily cohere ;)

@ewankeep I feel like "booting up old system images in a virtual machine to read an obsolete file format" would be an extremely believable plot point.
Or you could even write a whole book about trying to hunt down the parts to repair an obsolete machine that nobody knows how to emulate, just to look at some old pictures!
It'd be an archaeology adventure!

@ewankeep
* Use Emacs org mode files!
:PROPERTIES:
:utility: questionable
:length: excessive
:END:
- org (from "organizer" i believe) format is plain text with all the structure fully human readable and fairly unobtrusive even outside emacs or another tool that knows how to parse the format richly
(this post is also an example of the format)
(it's also an example of how if you don't have a rich parser you can still understand it with your human eyes)
- emacs dates from the 1970s, is still in use now, and will certainly still be in use in whatever future century you're writing in
- emacs doesn't need a meaningful amount of computing resource by modern standards. it's too big to run on a game boy but works beautifully on a raspberry pi or a 4th generation ipod nano, and you can leave out parts you don't need to make it even smaller
- emacs can emit many other formats from org source (word processor files, PDF and other page description formats for print, images and audio if you're clever, programs in other languages, databases, websites, ...) and it's easy to write new converters
- future retrocomputing is a great sf vibe especially when computing isn't a relative monoculture any more
- emacs is an enormously capable and customizable tool, you can use it as most of an OS if you want and with a little help it can talk to just about anything in the world (hence good for situations where your characters need to interoperate with relatively unalike systems)
- you will make a few nerds like me very happy

@alexis Oh, interesting idea. I'd been thinking of something like GeckOS (for Commodore 64 on unix-like) vibe, but yeah, if it runs emacs ithat's a whole OS in itself. At least that's my impression of emacs when I played around with it a little bit (as a writer not a coder myself.)

@ewankeep the sqlite file format¹ is likely to be one of these. Explicitly being friendly to reverse engineers is one of its design goals², and the lead developer is strongly motivated by a desire for the format to (long) outlive him.

(The sqlite codebase is one of the most widely deployed programs on the planet right now³ so he'll probably succeed!)

¹: https://www.sqlite.org/fileformat.html

²: https://www.sqlite.org/about.html

³: https://www.sqlite.org/mostdeployed.html

@gnomon Good to know about data bases, like this is how everything is built behind the scenes.

@ewankeep Lots of good answers already, I just want to add a remark about file compression. Lots of file formats include compression as part of the spec (PNG, GIF, PDF, ZIP, every video format). Good compression increases the entropy per bit to an ideal maximum, which makes the file look more and more like random noise. If, in your story, the knowledge of the specific compression algorithm is lost, it will make these formats near impossible to reverse engineer.

@ewankeep The most forever file format I know of is shapefile: introduced in 1998, it pretty much uses no technology that couldn't be written on a CP/M machine from the late 1970s. Its core database is a dBaseIII dbf file. It is very crude, and there have been many attempts to replace it, but it's still the one always-works geodata interchange format

https://www.loc.gov/preservation/digital/formats/fdd/fdd000280.shtml

... and that LOC preservation site might hint at a few more

@ewankeep You got plenty of good answers! I just wanted to share the Recommended Formats list from the US Library of Congress, which is chosen to be friendly to archivists (among other goals).
loc.gov/preservation/resources

@ewankeep plaintext?

and sometimes institutional inertia and cultural weight can drag something out way past its due (gif)

@ewankeep if I had to bet on something being around in a couple of centuries, METS/ALTO is a pair of XML standards which are widely used by libraries and digital archives, so there are sociological reasons to expect them to survive

@ewankeep Plain text is probably the only real forever format, and it would have to be UTF-8 encoded at that.

Everything else is subject to being eclipsed by a newer format in 10-20 years, with en ever-growing pile of legacy format readers/converters.

@ewankeep I suspect jpg will last longer than gif; I kinda doubt the odt will survive but that's probably my bias against how ugly the internal design feels, rather than anything practical.

@eqe @ewankeep since the image file format hasn't been answered: gif could not become a standard due to patents. They have expired, but now all of its merits have been reimplemented in png. Anyway, the image file format of choice is and has been TIFF, baseline (no fancy extensions), uncompressed. JPEG2000 also has some friends in the archiving community.

@ewankeep Plain HTML, epub, multipage TIFFs, PostScript and good old plaintext.

ODT is technically stable but it can be extended with the VBA macros scripts that certainly aren't forever.

DJVU is also still an open standard but its compression is lossy so the documents look worse and the file size / broadband speed is less of a problem nowadays.

The quantum computers won't run old scripts and programs so the true long-long-term documents are plain pictures, texts or (sigh) pictures of text.

@oreolek @ewankeep "quantum computers won't run old scripts" wait what, that's not what quantum computers are about.
Also we run plenty of old programs (even though it's not great for the document preservation).

@charlag @ewankeep "old" in platforms, not in years. Backward compatibility is not a given when jumping technologies and generations, and there might be a standard war or two in the meantime.

DOSBox boasts a 91%¹ compatibility with 80386 assembler code from 1985. I'm not sure any new architecture (ARM? VLIW? Quantum whatever?) in 50 years will run/emulate our current programs without weird issues.

¹: https://www.dosbox.com/status.php?show_status=1

@oreolek @charlag @ewankeep i know people who migrated their COBOL software to emulators after their hardware went out of support, and then… out of life

@ewankeep I think CSV will (sadly) never die. And there will always be ambiguity about what separates the values, wether values are quoted or not and which text encoding is used.

@erictapen I believe the command line spreadsheet sc-im uses CSV files. That's about as basic as you can get. Spreadsheets were among the first killer apps. So, it's something we'll need.

@ewankeep probably the most simple things possible like raw PCM, uncompressed bitmaps etc

@ewankeep what an excellent question! Thank you for asking it, I have no answers, but I will read the thread zealously!

@rysiek so many interesting answers! This thread really took off. Happy to get people thinking and talking to each other.

@ewankeep It's hard to beat TXT. But for images, the GIF/JPEG/PNG trinity is still holding up. Of those three, I think JPEG has survived the most dethroning attempts. Many have tried to replace it and failed. JPEG is probably immortal.

@ewankeep For audio, WAV may be even more future-proof than TXT files. There's no need to use a character encoding. If a future person finds a WAV file, if they know how signed integers work and if they know what a human voice sounds like, with just a little bit of effort and zero documentation, they will be able to decode a WAV. But of course, the files are pretty large.

@ewankeep (Using your example of an audiobook. Decoding sound recordings of things other than human voice would obviously depend on the decoder's experience with that kind of sound. But mostly just to get the sample rate right, which is the most difficult part. They might still figure out that it's a signal and try to play it as sound, they just might get the speed wrong.)

@ewankeep I'd advocate for SQLite to be added to the list, its opensource with thorough documentation, & doesn't take that much code to implement!

From what I've read MP3 doesn't have a public standard (and hasn't been fully implemented, only "profiles" thereof) so I don't think it fits your qualifications.

Ogg could last but doesn't standalone... Question for your setting: Does compression matter to them? How plentiful is bandwidth/storage?

@ewankeep Also worth remarking: HTML (excluding JavaScript) is very resilient! As long as we want to publish richtext documents, some variation of it could survive not incompatible with today's.

CSS *might* survive, but it'd probably evolve more to meet the cultural & technological needs of the time. Compatibility with today would be more of an issue.

Also for a near-future setting today's well-entrenched popularity of HTML vouches for it.

2/3

@ewankeep Seriously, that question of storage/bandwidth is important for answering your question. In a climate-apocalypse setting its not unrealistic for us to have lots of storage to scavange but limited compute.

In which case the pressure would be against compressed formats like (for audio) MP3 or FLAC, in favor of WAV or the 16bit 44.1khz stereo audio found on CDs. Wouldn't have much CPUtime to waste!

@ewankeep TXT, RTF, CSV, perhaps TIFF for multi-page b&w scans?

@ewankeep openoffice/libreoffice files are actually just zipped files of ascii text with markup.
I have written python programs to unzip them replace place-holder text from a db, duplicate pages, zip them up again as valid odt files or w/e - and change them to pdf after sometimes.
All just linux and opensource. Everything. :-)

@gemlog I didn't know that, that's cool you can unzip a odt and put it back together.

@ewankeep yes, I used that to make 'pretty looking' receipts 3-up. Next I opened the document in python and looped through it replacing names,dates and amounts, duplicated the page over and over until I reached the end of the query. I was reading pgsql with an sql-ledger db.
I know the idea has worked for others too, it's simple.

@ewankeep You don't need to write a program. Just change the name to .zip and uncompress it. then read it, you will see it is very readable and straightforward.

@gemlog @ewankeep Yeah this is actually exactly what I've done at my work, we have our own entirely custom client database setup and the main interface these days is a web one. We send a lot of letters to clients, either attached to emails as PDFs or in the actual mail, and a ton of them are just a few kinda stock types of letters, but with some logic juuuust complex enough to defeat any attempt to make them just mailmerge forms or such (ex. if this client is marked in our database as having this type of hardware, add a paragraph quoting them the current cost for their size of site for a third-party license, and then remove this *other* paragraph later---).

The easiest thing to do turned out to be just having my server-side code work out that stuff, substitute the generated plaintext markup strings into a template XML file detailing the core text of the document, and stuff that XML file back into a zip file containing the rest of the .odt document. Toss a few buttons onto the web UI et voila!
@gemlog @ewankeep The only really annoying thing is, since stuff like LibreOffice isn't actually overtly supporting people doing stuff like this, saving the same document with only minor tweaks can result in tons of arbitrary changes under the hood to stuff like what a certain style is called (ex. paragraph style "P28" is now "P32"), so once you embark down such a path any changes you do by reopening the resulting with the GUI and tweaking things there, well that's basically in the realm of manual patching.

I've personally hated using any word processing software since Corel '97, everything has been downhill since then in my curmudgeonly opinion, so this was almost more a feature than a bug for me! I'd actually rather read the OpenDocument Format spec than fiddle with these damn newfangled word processors that don't even let you see the raw markup they're using within the interface itself :P

@keithzg
I don't mess around with the styles, merely replace strings.
I made rent one afternoon reversing some dead msdos wordpro file format. It was a book someone wanted to reprint and no one could read the file any more. Still only python, but a hexeditor first.
@ewankeep

@keithzg
Exactly! I'm glad to read of someone else doing the same (ish) thing - it's simple to do and very useful.
I did muck about with the builtin libreoffice db stuff, but.. I got kinda lost and frustrated frankly. I found it easier to just hack the files directly.
I think @kmj did something similar.
@ewankeep

@ewankeep Could a competent developer easily implement a viewer for a file format given the specification? If so, that's awesome. Text and text-based formats (csv, html, markdown, json, xpm, etc) are great. Uncompressed multimedia (wav, bmp, etc) are good.

Beyond that, the number of existing implementations starts to matter. The code for web standards and widely implemented open source formats is likely to survive forever: jpg, png, gif, mp3, ogg, zip, tar/gz, webm, h.264.

Stuff like odt? Meh

@ewankeep I know it's technically a trashy format, but basically the only content that I have from the late 90s that are still just as useful/of apparent quality as they are today are PDFs. I consider the format basically eternal, at least as history has held out so far.

@ewankeep ironically .docx because so many things that aren’t word already support it anyways.

Sign in to participate in the conversation
MSP Social.net

A community centered on the Twin Cities of Minneapolis and St. Paul, Minnesota, and their surrounding region. Predominantly queer with a focus on urban and social justice issues.

<svg xmlns="http://www.w3.org/2000/svg" id="hometownlogo" x="0px" y="0px" viewBox="25 40 50 20" width="100%" height="100%"><g><path d="M55.9,53.9H35.3c-0.7,0-1.3,0.6-1.3,1.3s0.6,1.3,1.3,1.3h20.6c0.7,0,1.3-0.6,1.3-1.3S56.6,53.9,55.9,53.9z"/><path d="M55.9,58.2H35.3c-0.7,0-1.3,0.6-1.3,1.3s0.6,1.3,1.3,1.3h20.6c0.7,0,1.3-0.6,1.3-1.3S56.6,58.2,55.9,58.2z"/><path d="M55.9,62.6H35.3c-0.7,0-1.3,0.6-1.3,1.3s0.6,1.3,1.3,1.3h20.6c0.7,0,1.3-0.6,1.3-1.3S56.6,62.6,55.9,62.6z"/><path d="M64.8,53.9c-0.7,0-1.3,0.6-1.3,1.3v8.8c0,0.7,0.6,1.3,1.3,1.3s1.3-0.6,1.3-1.3v-8.8C66,54.4,65.4,53.9,64.8,53.9z"/><path d="M60.4,53.9c-0.7,0-1.3,0.6-1.3,1.3v8.8c0,0.7,0.6,1.3,1.3,1.3s1.3-0.6,1.3-1.3v-8.8C61.6,54.4,61.1,53.9,60.4,53.9z"/><path d="M63.7,48.3c1.3-0.7,2-2.5,2-5.6c0-3.6-0.9-7.8-3.3-7.8s-3.3,4.2-3.3,7.8c0,3.1,0.7,4.9,2,5.6v2.4c0,0.7,0.6,1.3,1.3,1.3 s1.3-0.6,1.3-1.3V48.3z M62.4,37.8c0.4,0.8,0.8,2.5,0.8,4.9c0,2.5-0.5,3.4-0.8,3.4s-0.8-0.9-0.8-3.4C61.7,40.3,62.1,38.6,62.4,37.8 z"/><path d="M57,42.7c0-0.1-0.1-0.1-0.1-0.2l-3.2-4.1c-0.2-0.3-0.6-0.5-1-0.5h-1.6v-1.9c0-0.7-0.6-1.3-1.3-1.3s-1.3,0.6-1.3,1.3V38 h-3.9h-1.1h-5.2c-0.4,0-0.7,0.2-1,0.5l-3.2,4.1c0,0.1-0.1,0.1-0.1,0.2c0,0-0.1,0.1-0.1,0.1C34,43,34,43.2,34,43.3v7.4 c0,0.7,0.6,1.3,1.3,1.3h5.2h7.4h8c0.7,0,1.3-0.6,1.3-1.3v-7.4c0-0.2,0-0.3-0.1-0.4C57,42.8,57,42.8,57,42.7z M41.7,49.5h-5.2v-4.9 h10.2v4.9H41.7z M48.5,42.1l-1.2-1.6h4.8l1.2,1.6H48.5z M44.1,40.5l1.2,1.6h-7.5l1.2-1.6H44.1z M49.2,44.6h5.5v4.9h-5.5V44.6z"/></g></svg>