Archaeology | Shaking my fist at the æther

OK, time to finally throw my hat into the political ring.

But I want to state for the record as a disclaimer that I do so not based on opinion, but merely based on the facts:

Tabs are better than spaces.

Now, quiet, you python people. You’re just wrong. I know that your style guide PEP says you should use 4 spaces.

But it’s wrong.

Now, you’ve all heard the old unix greybeard argument about how your files will be 2% smaller if you switch to tabs instead of spaces, because you’ll use 1 tab character rather than 4 space characters. While this argument is correct, it has nothing to do with my argument (it’s just another benefit of using tabs, as far as I’m concerned). But a small saving in file size isn’t a reason to change how you do things.

The reason you should change how you do things and start using tabs instead of spaces is simple: it’s the correct answer. But that’s not actually my primary reason. My primary reason is that it’s better.

Now, this might sound arrogant or whatever, but allow me to explain what’s actually going on under the hood, and how you can configure your editor correctly and we can all live in peace and harmony and never worry about this whole indentation thing ever again.

A lengthy treatise about the history of text (and how it’s indented)

A long long time ago – even before Nirvana – there were mechanical typewriters. That’s where the tab key comes from, since our computer systems were originally used with teletypes, which were based on typewriters.

But typewriters didn’t just have a tab key – they also had tab stops – a bar along the back of the typewriter with several movable latches which allowed you to set the tabs at any position you like. The behaviour of the tab key and tab stops in a WYSIWYG word processor emulates this pretty faithfully (though it is a superset of typewriter functionality, e.g typewriters had a limited number of tab stops and afaik could only do left-aligned tab stops).

When we started using teletypes and terminals, we were originally using fixed-width (i.e the screen was typically 80 or sometimes 40 characters wide, and used a monospaced font) text-only monochrome displays. And back in the 60s IIRC the ASCII standard was developed as a descendant of the baudot code used on telegraphs.

This standard defines a bunch of characters, and a bunch of control characters. If you’re familiar with ASCII or unicode at all you’ll recognise some of them. Some common examples:
character 32* – space
character 10 – linefeed
character 13 – carriage return
and character 65 – an uppercase “A”.

(* i tend to think in decimal, these are decimal values. All ascii values here should be decimal for consistency)

If you’ve ever played with colours in your terminal prompt, you might also recognise escape as character 27.

There are a bunch of these available, and you can see the full list with a simple ‘man ascii’ (assuming you have the relevant packages installed, apt-get install man-pages should do it on debian).

In this table, we see my beloved tab sitting at position 9. And you’ll also see one that you probably haven’t used before – character 11 – “vertical tab”.

All of these things are there for a reason, even though we almost never use some of them (like vertical tab) today.

There are a few intricacies of the ascii table which aren’t mentioned or immediately obvious from reading the man page I pointed you to. They’re a little more obvious if you look at a 4-column ascii table with the hex and binary values (<-- I'd encourage you to open that in a new window so you can look at it while reading this lengthy tome).

With this layout, it becomes more obvious that the first 32 ascii characters are in a special class that you probably already know about - these are the control characters. There is one other control character which is outside of this range and a special case - 127 / DEL.

Less Obvious is that this pattern of categorising the ascii table into sets of 32 applies for all four columns. The ASCII table was intended to be broken up this way: WE have four broad categories of characters here: control characters, symbols and numbers, uppercase, and lowercase.

Note another correspondence when we break the ascii table up in this way: the lower word (i.e the last 4 binary digits) are the same for each character for both uppercase and lowercase - we can think of the upper word / first four bits* as a "mode selector" to select between columns on this table, and the lower word selects one of the rows, giving a particular character.

(* in reality it's only three bits in the upper / most significant word, because we're only talking about 7-bit "pure" ascii today, but I'll be referring to them as two 4-bit words here to make things clearer - the most significant bit is always 0 for our purposes today)

This idea is modelled on an earlier code (baudot? something else? the history is long) and is in turn modelled on typewriters and how the shift key worked: On a mechanical typewriter, the shift key worked by physically shifting the printing mechanism or head (versions differed), and each "letter-stamper-thingy" on the typewriter had two characters - uppercase and lowercase (the names of which in turn come from a printing press operator's two cases of letters - uppercase tended to be used less often, so the operator would place it in the upper position, further away from his working area) - and depending on the position of the shift mechanism, selected between the two characters, giving each normal key two functions. Similarly, the number keys had symbols as their "uppercase character".

This design characteristic makes it pretty easy electronically to implement this "shift" mechanism for most of the keys on your keyboard without any special logic to handle upper/lowercase - each key has an encoded 4-bit value, and depending on the state of the shift key we set or unset bit 3 of the upper word (it's a little more complex than this these days, e.g capslock).

And that's why teletypes were fairly common already by the time computers were invented - they're a lot simpler - the character table is designed to make it easy electronically.

But it doesn't stop at the keyboard, it's also easier to interpret on the decoding end: if your bit 3 is set, you want to select a lowercase glyph. This is a very easy test that can be done with few logic gates, and in very few instructions on most(all?) computer processors.

So this meant that when computers came around, and we wanted a system to have them represent text and interact with keyboards, adopting this table made a lot of sense due to the slow speed of those early machines - efficiency was everything. And so ASCII was born - people took clever ideas of their predecessors and expanded on them.

You'll also notice that in this layout, the symbol characters between the uppercase and lowercase and at values >=123 make more sense – if you’ve ever looked at an ascii chart and wondered why e.g the symbols or letters weren’t all in one contiguous region, this is why!

(Today, we’re not technically using ASCII anymore – these days, all modern operating systems use unicode. But unicode takes this compatibility thing that ascii did even further – you may know that unicode is byte-compatible with 7-bit ascii, so a pure ascii file (and most english text from other similar encodings, e.g iso-8859-1, too) is also a valid, identical unicode file)

So far we’ve only covered columns 2-4, but a simple glance at our ascii table shows that column 1 is special. And you already know why: none of these are printable characters – except, debatably, tab.

You probably know about nonprintable characters – unicode means that most computers have lots and lots of them today. But you might not know the distinction between a printable / nonprintable character and a control character. And that’s what this column actually is – these are the control characters, not the nonprintable characters.

There is one other control character – DEL – which doesn’t live in this column. I’m not sure where it’s position at the end of the table originated and how that decision came about. But this is also relatively easy to test electronically – a 7-way AND gate on your 7 bits, and in code. Putting it at the end of the table like that makes it a relatively simple exception that you need to accommodate.

They’re control characters because this encoding was invented to provide all the functionality of all the various teletype machines out there, providing “one encoding to rule them all”, which should be able to work with any teletype, providing interoperability.

Teletype machines needed to have a way to signal to each other that this should be the end of the line, for example, and so you have a linefeed character. Today you might think of a linefeed as “just another character”, but the term “control character” isn’t just a pretty name – in it’s original intent, “linefeed” is not a character but an in-stream instruction for the receiving device, which means “move the physical roller which controls the vertical position of the physical paper in the actual real world one line down”. Presumably on some teletypes it also meant “…and return the physical IRL print head to the first column”, and on some it didn’t. In order to support all the features of all the teletype machines out there, a bunch of control characters were needed.

No, I have no idea what half of them do, either.

I do know about a couple that you may not have heard of. For instance, there’s the one that I call “EOF” – end of file, but which the ascii table lists as “End Of Transmission”, at position 4. Unix implements this as it’s “End Of File” character – this is what your terminal sends down the line when you press CTRL-D. It’s why you can press CTRL-D to log out of your terminal. It’s also why you can do

$ cat - > /tmp/foo (enter)
foo(enter)
bar(enter)
(ctrl-d)
$ cat /tmp/foo
foo
bar

to create a file which includes linefeeds from the unix prompt, using cat to read from stdin and then using ctrl-d to send the the end-of-file character to tell the system that you’re done inputting data.

A more commonly known one due to a decision by microsoft to be contrarian is the difference between a linefeed (“move 1 line down”) and a carriage return (“return the carriage (or cursor) back to column 1″). Technically microsoft’s preference of doing both a carriage return and linefeed is perhaps more historically accurate, since in almost all cases you would want to do both of these things when the enter/return key is pressed, whereas unix says that a linefeed implies a carriage return, and interprets carriage return as “*only* do a carriage return, not a linefeed”, meaning that on unix CR allows you to “echo over” the same line again, and that means you can draw bar charts in bash using echo -e “\r$barchart” in a loop.

I member a time when *nix used LF, Windows used CR + LF, and macs used CR just to be totally goddamn annoying. Apple adopted LF along with unix with the advent of Mac OS X, so that’s not a thing anymore unless you’re into retrocomputing.

You may have seen the good old ^H^H^H^H^H^H joke, where a person is deleting their code. This is because the backspace character/key at position 8 was traditionally mapped to CTRL-H, which could render on some terminals visibly as ^H rather than a backspace depending on a ton of hardware variations and compatibility settings on the terminal you were sitting at and the terminal you were talking to.

CTRL-L clears the screen on *nix because it’s mapped to the form feed character at position 12. Likewise CTRL-C is mapped to character 3 (end of text, i’ve always called it ‘interrupt’). I believe that the dreaded CTRL-S and CTRL-Q to freeze/unfreeze output on your terminal are mapped to control characters, too, but I couldn’t tell you which ones.

There’s also a fun one which doesn’t appear to be mapped on my modern linux machine – CTRL-G, to ring the terminal bell.

These control key sequences exist because when people started using different terminals to talk to unix systems, they quickly found that not all terminals were the same. E.g not all of them had a ‘backspace’ or a ‘clear screen’ key, but all of them had some kind of “control” or “modifier” key, so the control sequences were added for people who didn’t have the corresponding key. To this day, I have a ‘compatibility’ tab in my terminal which allows me to tell the terminal to send a CTRL-H key sequence for backspace, amongst other things.

A short aside:

As I’ve demonstrated above, one of the pitfalls that we find ourselves running into on modern unix systems is that by the time you get to a terminal emulator in your gpu-accelerated, composited GUI, you’re running many layers of abstraction and compatibility deep: Your terminal is emulating and backwards-compatible with VT100 dumb-terminal hardware from perhaps the 1970s, patched to be able to support unicode, which is itself a backwards-compatible extension on top of the backwards-compatible extension of a previous code that is ascii, going all the way back to bardot and the telegraph in the late 1800s. So, no, it’s not as straightforward as you’d expect to write code to say “move the cursor to position x,y” on a unix console.

This causes us a bunch of problems and causes us limitations on modern desktop unix systems perhaps more often than it helps the average user. If you read the unix-hater’s handbook, you’ll find an entire chapter on how /dev/tty and the terminal emulator is the worst thing in the entire universe. This is generally acknowledged as one of unix’s “foibles”.

So why hasn’t anyone done anything about all that legacy stuff?

Because one of the joys and beauties of unix is the deeply-ingrained principles of backwards compatibility and portability that came to embody the unix philosophy over the course of decades. Which means that I can still (relatively) easily connect my modern terminal emulator up to an antique teletype and have it be compatible to a pretty decent extent.

This is an important quality of unix. It’s important to keep these open, compatible standards around for the purpose of the preservation of information. If we had moved from ascii to an incompatible standard, we would have had to convert every single document ever written in ascii into that new standard, or potentially lose the information as the old and incompatible ascii standard became more and more rare and unknown.

And if you search youtube, you can find people hooking modern systems up to antique teletypes. For my money that makes it all worth it.

But finally, Let’s talk about tab.

Note that space is up at position 32, in column 2 with the printable characters. I’ve seen space categorised as a nonprintable character, but this is the wrong way of thinking about it. A better way is to think of space as a fully black glyph on an oldschool fixed-width text terminal (regardless of whether or not it was actually implemented this way). You want a space character to erase any pre-existing character at that position on the screen, for example. And you want that “move on to the next screen column with each keypress, so that the user can type left-to-right” functionality that you get from making it a fully-black glyph.

For example, in bash:

echo -e "12345 \r     67890"

doesn’t give you the output:

1234567890

it gives you:

- the spaces erase the previously-printed characters.

Space is a printable character.

Tab is a control character.

I was tempted to write “which means ‘print 4 spaces’ on my system”, but I thought I’d do another bash example/test/demonstration, and I surprised even myself. On my system, it’s not “print 4 spaces” at all:

$ echo -e "1234567890\r\tABCDEF"
1234ABCDEF

I had expected this to echo

ABCDEF

But it turns out that the implementation of tab on my system is a bit more complicated than that. Instead it means “indent by one tab width”. If I did:

$ tabs -8
$ echo -e "1234567890\r\tABCDEF"

I’d get:

12345678ABCDEF

And if I do:

$echo -e "\tsomething"
	something

That’s not 4 spaces that it’s printed at the start of the line – try selecting that text – it’s a single tab character, and its width is whatever your tab width is set to (since it’s being displayed on your machine right now).

I think this demonstrates pretty clearly that space is printable and tab is control

When fixed-with, monochrome teletypes and terminals were the norm (and for a long time they were the best way for humans to talk to computers – they beat the shit out of punchcards), and the ascii standard was adopted for use on a screen – with generally more capability than a teletype (a screen can easily delete characters / clear itself, and can emulate an infinite roll of paper by scrolling lines), indentation came up. This caused an issue at the time because they didn’t have WYSIWYG word processors with an infinite number of center-aligned tabs that could do everything your typewriter could do. Instead, they had this atomic system – there was no physical way on these devices to have a ‘half-character-width’ tab, like you could on a typewriter. And not a lot of memory or processor power for implementing fancy rules around kiiiiinda-trivial stuff like tabs. So the compromise that was reached was making a tab equal to a certain number of spaces.

But how many spaces? Some said 4, I think some said 8, and some said 2. This is what the ‘tab width’ setting of your text editor means. I’m sure others did more complex things with tab, like “indent to the same column as the next word from the line above”.

I’m not sure where the convention of “a tab equals 4 spaces” came from, but that’s certainly the one that became dominant at some point. Maybe it’s standardised somewhere, maybe it’s just a popular convention.

The point is, the way that tabs was handled used to differ at one point between different terminal hardware and/or settings. This is why tab settings are so seemingly-complicated in plaintext editors today – Similarly to why ASCII has so many control characters, terminal emulators wanted to be able to emulate multiple types of terminal, so the tab settings had to be a superset of all of them.

The practical upshot of all this means that by correctly using your IDE’s “Tab width” setting, if you use tabs for indentation, you don’t need to have this argument about whether a tab should be 2 or 4 or 8 or 32 spaces: You simply set the tab width to your preference and tell your IDE to use tabs for indentation, and you’re set, and can see it indented however you like, and so can everybody else. We can all just use tabs correctly, and live in peace and tolerate each other’s preferences for indenting.

(The correct IDE settings are: Tab width: whatever you prefer; Use tabs for indentation, never spaces; aggressively and automatically convert groups of spaces *at the start of the line* into tabs. Auto-indent. If your editor can’t do these things, you should use a better one. Scite and Geany are good).

And there are valid preferences, too – I personally use 4 spaces indents on a desktop or laptop machine where characters are small and screen real estate is cheap, but if you’re coding on a small form-factor device with a small screen that can’t display long lines easily and large enough to be readable (like my openpandora), an indent of 2 characters is much more workable.

Another still valid though less-relevant-today reason to have a preference about tab width is something i only touched on very briefly earlier – some of these fixed-width displays were 40 columns, and some were 80 columns. The most common 40 column displays you would see were on the 8-bit microcomputers of the 80s, which tended to be built to hook up to TVs via an RF modulator, typically leading to insufficient resolution to do 80 columns and be readable. On a 40 column device there’s a good argument for a smaller indent for the same reason as I have on my openpandora – screen real estate.

So to start summing this all up and getting back to my original point, and although I’ve spent a million words describing the “why it’s more technically and semantically correct”, my #1 argument for tabs is not even based on any principle of it being more technically or semantically correct, or respecting the past, or anything like that.

I argue for tabs over spaces for indentation based on features: Done correctly, it removes the whole “How wide should an indent be?” question and allows users to decide based on their preference while still working together and having consistent code.

But I do also argue for it based on a nerdy “technical correctness” and “compliance with well-reasoned specifications” principles, too: In python, tab is even more explicitly semantically correct – in python we use indentation to signal a block of code to the interpreter. That’s the job of a control character, not of a printable character. That’s exactly what control characters are designed for. Those smart guys back in the 1960s or 1910s or whenever it was knew what they were doing when they put space in there with all the other printable characters.

However, note that when I say that you should be using tabs for indentation, I do not mean they should also be used for formatting – that does cause issues, as many advocates of space have pointed out in the past. I think maybe this is the most common pitfall is that people run into which makes them prefer spaces. But understanding these tab settings is not hard, and there’s a benefit for all users, and it’s the correct option, and also it saves you some space, because one tab character is one quarter the size of 4 space characters!*

(* this old argument for tabs is actually not really true anymore a lot of the time: if you’re transferring this as plaintext over http, you’re probably using a modern web browser which supports http2 and/or gzip compression, and it’s quite likely you’re talking to a server that also supports it, so there’s a very good chance that you’re getting those 4 space characters gzipped, even if you’re not minifying your javascript, and in that case those 4 tabs will take up perhaps 10 or 11 bits of data vs the 8 bits a tab would use )

So, for example:

#!/usr/bin/env python3

def something():
	# this line is indented. You should use a single tab character to indent it.
	#    but if I want to indent this line inside the comment, this is formatting, 
	#    and I shouldn't use tab for that.
	#
	#<-- tab
	#    <-- spaces      
	#
	# so, for example, to make an ascii-art table outlining the characters on this line:
	#    ----
	#
	# it would be:
	#  pos | character
	# -----------------
	#   1  | tab
	#   2  | hash
	#   3  | space
	#   4  | space
	#   5  | space
	#   6  | space
	#   7  | hyphen
	#   8  | hyphen
	#   9  | hyphen
	#   10 | hyphen        # note consistent column widths here, 10 is longer than 9, 
	#                      #   don't use tabs here between the hash and pipe characters

	run_code()

In the code world I've found that this formatting rule boils down to a pretty simple generalisation: left of the comment signifier (the hash character in python), that's indentation, right of it is formatting.

(yes, there are always weird edge cases, like heredocs, where formatting and indentation simply cannot be done well and unambiguously, but I've found this system to work pretty well. In these cases you should do what seems best and cleanest)

And now hopefully you know why tabs are correct and spaces are wrong. Please feel free to disagree and argue that the PEP says so, but just know advance that if you do that you will be wrong.

More seriously, I would welcome discussion over some of the edge cases and pitfalls that people can run into with regard to this stuff. I find that a lot of the issues that people complain about with tabs also occur with spaces. It'd be cool to put together an exhaustive resource on the subject to document what is totally the empirically correct way to do it.

If you made it through this may thousand rambling words over something that many would consider trivial, thanks for reading

Shaking my fist at the æther

AntiSols Blog

Category Archives: Archaeology

Why you’re wrong if you think spaces are better than tabs

A lengthy treatise about the history of text (and how it’s indented)

News from another century