The Association of Mad Scientists

A Demonstration of Pointer Arithmetic, using the clear syntax of the Crystal programming language.

When writing a program, one can send around an address to a variable, rather than the variable's contents. By doing so, one is given the freedom to modify the original value, or to look at what's in the memory around that value. This can be incredibly useful in code that doesn't abstract these concepts. However, the syntax used by these languages makes working with them difficult to read, write, and even more difficult to learn. Additionally, unless these actions are handled with extreme care, unexpected, strange, and downright dangerous behavior can result.

Crystal is a language which uses object-oriented syntax primarily based on Ruby, which compiles to low-level code. It contains high-level abstractions for data-types such as arrays, hash-maps, strings, etc., but it also has libraries for dealing with "unsafe", low-level, direct memory access. Thanks to its Ruby roots, however, it has simple and intuitive syntax for dealing with these constructs. As such, it'll be much easier to demonstrate how pointers work using that syntax.

The syntax for pointers in crystal is as follows:

variable = 1234 # some arbitrary variable
pointer = pointerof(variable) # aquire the memory address of the variable

puts pointer.value # => 1234

So to get a pointer, you call pointerof on the variable, and to dereference said pointer, you use the pointer's value attribute. This means you can write a function which modifies the original value, like so:

def change_value(ptr : Pointer(Int32))
  ptr.value += 789
end

change_value pointer
puts variable # => 2023

Which is kinda nifty, but really more dangerous than it is useful, especially with an integer being passed around, which is no less efficient to pass around than a pointer (since a pointer is an integer underneath – the number of the "memory block" where the data is stored). Pointer arithmetic is the process of referencing the memory around a pointer by performing arithmetic on the memory address you receive.

A simple example of this is a "C string", which is a pointer to a byte which begins a sequence of bytes terminated in Zero which decode to a series of characters. So if you have the C string char *string = "some text", you can reference the 'm' by asking for the value of the pointer at string + 2, or the second byte after the one referenced by the string variable. When you "add" to a pointer, you're adding the number you give it plus the number of bits represented by the type of the variable you're performing the arithmetic on. More on this towards the end.

I was working on creating bindings to eSpeak for Crystal, when I saw this data type explained, which really made the term "pointer arithmetic" click for me.

the "languages" field consists of a list of (UTF8) language names for which this voice may be used, each language name in the list is terminated by a zero byte and is also preceded by a single byte which gives a "priority" number.  The list of languages is terminated by an additional zero byte.

So each "language" in this list is in a contiguous segment of memory organized like this:

|      1        |        2           |       3       |   4   | ... |last|
|---------------|--------------------|---------------|-------|-----|----|
| priority value| lang. first letter | second letter | third | ... |  0 |

A bunch of these memory regions are stuck up next to each other, then another 0 byte is added to the end to signal the end of the list, then you get the address of the first byte of the structure.

Of course, any sensible language would simple create a mapping of priorities to languages, but that would be slightly less efficient so the C code uses just some pointers and relies on the values to be correct. Regardless, it's an excellent example to show how pointer arithmetic can work.

So, lets figure out how to parse this. First, lets define a parse_one_language function, which accepts a Pointer to a byte, and returns the priority, the language as a string, and the pointer to the first byte of the next language.

def parse_one_language(first_byte : Pointer(UInt8)) : Tuple(UInt8, String, Pointer(UInt8))
  offset = 1
  loop do
    byte = first_byte + offset
      break if byte.value === 0
    offset += 1
  end
  utf_bytes = Bytes.new(first_byte + 1, offset - 1)
  return first_byte.value, String.new(utf_bytes), (first_byte + offset + 1)
end

This demonstrates actually finding the offset in memory at which the zero byte which flags the end of our string is located, and using that to parse the bytes into a string. That's useful for demonstration purposes, but if you ever have an array of bytes you need to parse as a string, you should use String.new instead, which will seek out the null byte itself.

def parse_one_language(first_byte : Pointer(UInt8)) : Tuple(UInt8, String, Pointer(UInt8))
  offset = 1
  until (first_byte + offset).value === 0
    # increment offset until we find a zero byte
    offset += 1
  end
  return first_byte.value, String.new(first_byte + 1), (first_byte + offset + 1)
end

Now, note the third value. first_byte + offset gets us the last byte of this data structure, so adding one more to that gives the first byte of the next structure, which gives us some convenience in constructing parse_language_list:

def parse_language_list(first_byte : Pointer(UInt8))
  languages = {} of String => Int32 # start off with an empty hash
  loop do # loop forever
    priority, language, first_byte = parse_one_language(first_byte)
    # call out to the above function and get the three relevant values
    languages[language] = priority.to_i
    # store them in a more usable data structure
    return languages if first_byte.value === 0
    # if the first byte of the next item is zero, we're all done
  end
end

It's really interesting, after spending most of my time in working with "safe" languages to see how direct memory access is implemented, and to see a few examples in syntax that isn't obfuscated by obscure and abstract syntax.

Let's take one more example, to avoid confusion on the process of pointer arithmetic on types other than char. A char in C is one byte, so in this case, when take pointerof(some_var) + 3, we mean 3 bytes past the location of some_var. However, pointer arithmetic doesn't apply to bytes, it applies to the size of the given type. So, lets store a series of 32-bit integers in contiguous space, then read them back and sum them.

# store an integer in a variable
first_value = 1 
# create a new Pointer variable that is the address of the number
array = pointerof(first_value)
# store some more values in the next 10 x 4 bytes per integer = 40 bytes
10.times do |offset|
	(array + offset).value = rand
end

def sum(array : Pointer(Int32)) : Int32
  output = 0
  10.times do |offset|
    output += (array + offset).value
  end
  output
end

puts sum array

In this example, we're working with pointers to 32-bit integers, each of which consume 4 bytes of memory, not one like the char examples above. So every time we advance the offset by one, we advance four bytes further into the memory, not one.