Paul Bone

The right amount of poison

Tue, 13 Feb 2024 00:00:00 +1100

Oh, you don’t want any poison in your porridge. But how about in your computer’s memory?

Papa Bear - too much poison

Papa Bear likes his chair hard, his porridge hot and his browser written in a memory safe language that helps engineers avoid memory bugs like buffer overruns and use after frees.

But even Papa Bear has to compromise, part of Firefox is written in a memory safe language and the rest is written in C++. When using C++ there are a variety of defenses programmers can take to help catch memory errors. One of those is called memory poisoning.

mozjemalloc the memory allocator built into Firefox will poison memory by calling memset(aPtr, 0xE5, size); before freeing it. Any memory containing the pattern 0xE5E5E5E5 is therefore very likely to be memory that’s already been freed. This has two and a half benefits: If some code were to free and then dereference some memory (a use after free bug) it would most likely cause the browser to crash, which is much better than a potentially exploitable bug allowing Goldilocks to steal Papa Bear’s banking credentials! The other benefit is that when Firefox does crash due to such a use-after-free, the presence of this pattern in the crash report allows engineers to see the type of error that occurred and hopefully fix the mistake.

Note that back in March 2023 we moved the poison operation outside of the arena lock’s critical section; which improved performance in some tests.

Mama Bear - no poisoning

You probably figured out by now that I’m going to persist with this metaphor. Mama Bear likes her chair soft, her porridge cold (and congealed (yuck)), and her browser fast.

But how much faster is Mama Bear’s experience? This is the question that was raised recently when Randell Jesup was benchmarking various memory allocators in Firefox. He noted that while mozjemalloc performs poisoning, many of the other allocators do not and to compare the performance of the allocators more fairly they should either all perform poisoning or none of them should.

And so Randell noted that, depending on the test, Firefox could be between 0.5% and 4% faster with poisoning disabled.

There are some results I collected. The "sp2" (Speedometer 2) and "sp3" (Speedometer 3) tests are browser benchmarks - larger numbers indicate better performance. The amazon and instagram tests are pageload tests measured in seconds with the ContentfulSpeedIndex metric - smaller numbers indicate better performance.

	sp2 (score)	sp3 (score)	amazon (sec)	instagram (sec)
Poison	178.84 ± 0.84	13.32 ± 1.03	243.2 ± 1.96	419.43 ± 1.04
No poisoning	179.42 ± 0.48	13.39 ± 0.31	237.55 ± 2.6	414.5 ± 0.8

sp2 (score)

sp3 (score)

amazon (sec)

instagram (sec)

Poison

178.84 ± 0.84

13.32 ± 1.03

243.2 ± 1.96

419.43 ± 1.04

No poisoning

179.42 ± 0.48

13.39 ± 0.31

237.55 ± 2.6

414.5 ± 0.8

The speedometer figures are pretty close and these are the best pageload figures (the others showed very little difference but nothing regressed, yes I’m aware I’ve cherry-picked data).

This means that if it weren’t for the lack of security and debugability Mama Bear would have the right approach.

Baby Bear

Baby Bear loves a compromise, they want their computer to be safe from Goldilocks' hacking attempts but also love performance improvements.

One compromise may be to probabilistic poison memory some of the time, e.g. a roughly 5% chance of poisoning. That’s more complex and involves a memory write anyway to keep the "time until poison" counter updated. We didn’t investigate it. But it’s worth noting that it would be similar in spirit to the Probabilistic Heap Checker (PHC) that’s rolling out in Firefox or the similar GWP-ASan capability in Chrome.

Instead we tested "what if we poison only the first cache line of a memory cell". Andrew McCreight and Olli Pettay pointed out that Element, a common DOM structure, is 128 bytes long and poisoning it is useful to detect memory errors in DOM code, as a lot of DOM code will involve Element.

We tested poisoning the first 64, 128 and 256 bytes of each structure. We assume that management of cache and writing cache lines back to RAM is going to be the dominant cost. Therefore we round-up our writes to the next cache line boundary..

For example, on a computer with 64-byte cache lines, if a 96-byte object is allocated so that the first 32-bytes is in one cache-line, while the next 64-bytes is in another. Our 64-byte write would cover two halves of different cache lines. In this case we will poison all 96-bytes because doing so writes to the same number of cache lines as the original 64-byte write.

Let’s add these options to our table of results.

	sp2 (score)	sp3 (score)	amazon (sec)	instagram (sec)
Poison	178.84 ± 0.84	13.32 ± 1.03	243.20 ± 1.96	419.43 ± 1.04
Poison 256	179.50 ± 0.55	13.35 ± 0.33	240.47 ± 2.82	415.28 ± 1.30
Poison 128	179.19 ± 0.43	13.35 ± 0.59	241.62 ± 3.05	414.95 ± 1.15
Poison 64	179.09 ± 0.87	13.33 ± 0.83	242.13 ± 2.56	414.11 ± 0.91
No poisoning	179.42 ± 0.48	13.39 ± 0.31	237.55 ± 2.60	414.5 ± 0.8

sp2 (score)

sp3 (score)

amazon (sec)

instagram (sec)

Poison

178.84 ± 0.84

13.32 ± 1.03

243.20 ± 1.96

419.43 ± 1.04

Poison 256

179.50 ± 0.55

13.35 ± 0.33

240.47 ± 2.82

415.28 ± 1.30

Poison 128

179.19 ± 0.43

13.35 ± 0.59

241.62 ± 3.05

414.95 ± 1.15

Poison 64

179.09 ± 0.87

13.33 ± 0.83

242.13 ± 2.56

414.11 ± 0.91

No poisoning

179.42 ± 0.48

13.39 ± 0.31

237.55 ± 2.60

414.5 ± 0.8

As above, sp2 and sp3 are scores - bigger numbers are better. While amazon and instagram are page load tests where smaller numbers are better.

As expected the partial poisoning results fall between full and no poisoning. But what’s a little bit surprising is that in some tests (sp2 and amazon) poisoning a larger amount of memory made things faster. This could be because the memset() routine or the hardware itself is able to optimise larger writes more effectively. That said it’s important to acknowledge that the standard deviation is fairly high and doing the right statistical analysis is beyond this blog post.

Just right

Since poisoning more memory isn’t much slower and in some cases is faster than poisoning a little memory, then we might as well choose to poison 256 bytes which comfortably covers the Element object and most others and for the others it likely covers many of their most-often accessed fields. We’re confident that this is enough to help us catch many errors that can be caught with poisoning. While also performing well enough, especially for the pageload tests where it is closer to the performance available with poisoning disabled. We think that Baby Bear would agree, it is Just Right.

It gets better

With the Probablistic Heap Checker (PHC) rolling out soon we will have an even greater ability to catch information related to memory errors. I’ll be writing about this in the future.

Why Papa Bear is safe and Mama Bear is secure?

In some ways it feels more natural to lean in to (negative) gender stereotypes where Papa Bear wants things fast and Mama Bear is the cautious one. I considered this however to make comprehension easier it’s easier to explain poisoning before explaining turning poisoning off and the nursery tale describes Papa Bear’s preferences first, so that’s the order I introduced them here. Flipping the script on gender stereotypes was accidental.

Waiting for web content to do something in a Firefox mochitest

Fri, 25 Nov 2022 00:00:00 +1100

It’s not unusual for a Firefox test to have to wait for various things such as a tab loading. But recently I needed to write a test that loaded a content tab with a web worker and wait for that before observing the result in a different tab. I am writing this for my own reference in the future, and if it helps someone else, that’s extra good. But I don’t think it will be of much interest if you don’t work on Firefox as the problem I’m solving won’t be relevant and the APIs won’t be familiar.

I don’t think of myself as a JavaScript programmer - I’m learning what I need to know when I need to know it, but mainly to write tests. So I’m not sure I’ll pitch this article at any particular level of JS knowledge, sorry.

Web Workers

Web Workers provide web pages a way to execute long-running JavaScript tasks in a separate thread, where it won’t block the main event loop. They solve the same problem, allowing a page to use concurrency. However their programming model is more like processes because they don’t share state (global variables or even functions) and communicate by sending and receiving messages.

I realise this is a tangent but it’s a topic I like and you may have the same questions I did: So if workers are supposed to solve the same problems as threads do in other languages, why are they more like processes? Furthermore, at least in Firefox, each worker instantiates another copy of the JavaScript engine (the JSRuntime class) with its own instantiation of JIT, GC etc. Isn’t this fairly heavy just to add concurrency?

It is, but there are benefits:

I’m not certain, but I think this was the easiest way to retrofit concurrency to JavaScript (the language standard) without breaking backwards compatibility with existing web sites.
Message-passing concurrency makes the boundary between threads very clear. This makes it a simpler programming model, especially if you’re working on some code that is isolated from the concurrency happening elsewhere.
It worked for Erlang, although Erlang likely shares bytecode caches and some other systems. But not garbage collection.

Anyway, the point is that Web Workers are concurrent "process like" things that communicate through message-passing.

about:performance

Firefox has a number of about: pages, used for diagnostics and tweaking. about:config is probably the most infamous (if you touch those settings you can break your browser or make it insecure). about:support is interesting too it contains diagnostic information about Firefox on your computer.

Today we’re looking at about:performance, which is useful when you are thinking "Firefox seems slow, I wonder why..". about:performance will show your busiest tabs, how much CPU time/power and memory they’re using.

Measuring memory usage can be tricky at the best of times (more on this in an upcoming article). We can’t afford to count every allocation since that is too slow for a page like about:performance. Although about:memory comes closer to doing this. For about:performance we can ask major subsystems how much memory they’re using and rely on their counters. This isn’t accurate but it’s good enough.

I noticed two major things that weren’t counted:

Malloc memory used by JS objects was not counted.
Web workers were not counted.

I fixed them in Bug 1760920.

So I wanted to write a test that would verify that we are indeed counting memory belonging to web workers.

My web worker

To make it easier to see if we’re counting a component’s memory, it’s great of our test causes that component to use a lot of memory then we can test for that.

Here’s a Web Worker that uses about 40MB of memory using an array with 4 million elements.

var big_array = [];
var n = 0;

onmessage = function(e) {
  var sum = 0;
  if (n == 0) {
    for (let i = 0; i < 4 * 1024 * 1024; i++) {
      big_array[i] = i * i;
    }
  } else {
    for (let i = 0; i < 4 * 1024 * 1024; i++) {
      sum += big_array[i];
      big_array[i] += 1;
    }
  }
  self.postMessage(`Iter: ${n}, sum: ${sum}`);
  n++;
};

It registers an onmessage event hander. When the page sends it a message it will execute the anonymous function. The first time this happens the function will create the array, the next time it will manipulate the array. Since the array is a global and is also captured by the handler I doubt the GC would free it. But I also don’t want an optimiser (now or in the future) from reducing the whole program to a large summation, or caching an answer. Which is why the array is manipulated each time the event handler is called. It doesn’t matter that it’s ridiculous - it’s a test - just that it uses "enough" memory.

From the main page it can be started like this:

  var worker = new Worker("workers_memory_script.js");
  worker.postMessage(n);

But that’s not enough to make a working test.

The test

Our test needs to open this page in one tab, and in another tab look at about:performance and observe that the memory is being used. Opening and managing multiple tabs and is standard faire for a browser test, but what we need is for our test to wait for the tab with the worker to be /ready/.

Waiting for a tab to be loaded is also very easy, which means that the tab will have executed worker.postMessage(n) by the time the test code checks. But that doesn’t mean that the worker has received the message.

So we need to make our test wait for the worker to start and complete one iteration (creating its array).

In the test we can add code such as:

  let tabContent = BrowserTestUtils.addTab(gBrowser, url);

  // Wait for the browser to load the tab.
  await BrowserTestUtils.browserLoaded(tabContent.linkedBrowser);

  // For some of these tests we have to wait for the test to consume some
  // computation or memory.
  await SpecialPowers.spawn(tabContent.linkedBrowser, [], async () => {
    await content.wrappedJSObject.waitForTestReady();
  });

The last three lines here are the interesting ones. SpecialPowers.spawn allows us to execute code in the context of the tab. In which we wait on a promise that the test is ready.

Now we need to add this promise to the page that owns the worker:

  var result = document.querySelector('#result');
  var worker = new Worker("workers_memory_script.js");
  var n = 1;

  var waitPromise = new Promise(ready => {
    worker.onmessage = function(event) {
      result.textContent = event.data;
      ready();

      // We seem to need to keep the worker doing something to keep the
      // memory usage up.
      setTimeout(() => {
        n++;
        worker.postMessage(n);
      }, 1000);
    };
  });

  worker.postMessage(n);

  window.waitForTestReady = async () => {
    await waitPromise;
  };

Starting at the bottom. For some reason I had to wrap the promise up in a function, I can’t remember why! I’m tempted to complain about JavaScript and it’s inconsistent rules here, but it could also be my limited understanding preventing me from getting it. What I do know is that this function must be in the window object so that the test code above can find it in wrappedJSObject.

The promise wrapped here (waitPromise I could have picked a better name) is resolved when ready() is called, which happens after we receive the worker’s response. Finally we use setTimeout() to post another message to keep memory usage up. I don’t know why this was necessary either. Was the worker completely terminated without it?

One more thing

Our test almost works. For whatever reason when the test accesses the right part of the about:performance page there’s no value for how much memory is being used. Waiting for a single update fixes this:

  if (!memCell.innerText) {
    info("There's no text yet, wait for an update");
    await new Promise(resolve => {
      let observer = new row.ownerDocument.ownerGlobal.MutationObserver(() => {
        observer.disconnect();
        resolve();
      });
      observer.observe(memCell, { childList: true });
    });
  }
  let text = memCell.innerText;

For the complete code for this test checkout Bug 1760920 and toolkit/components/aboutperformance/tests/browser.

There’s things I don’t know

There’s three places here where I’ve said "it needs this code, I don’t know why". I hate programming like this, and I feel shameful writing it in a blog post and calling myself an engineer. I don’t want to spin it as a joke on JavaScript, or myself "lol, that’s programming! AMIRITE?!" There’s obviously some further subtleties I don’t know the rules for, and JavaScript does have some pretty inconsistent rules, throw in a browser, two tabs and a web worker and feeling like you don’t know how something works is relatable.

Do I wish I knew? Sure, I’m uncomfortable not knowing, but I’ve already spent enough time on this. But this is also why I wrote down what I do know. Next time I’ll be able to find this much and solve my problem quicker.

Running the AWSY benchmark in the Firefox profiler

Sat, 18 Sep 2021 00:00:00 +1000

The are we slim yet (AWSY) benchmark measures memory usage. Recently when I made a simple change to firefox and expected it might save a bit of memory, it actually increased memory usage on the AWSY benchmark.

We have lots of tools to hunt down memory usage problems. But to see an almost "log" of when garbage collection and cycle collection occurs, the Firefox profiler is amazing.

I wanted to profile the AWSY benchmark to try and understand what was happening with GC scheduling. But it didn’t work out-of-the-box. This is one of those blog posts that I’m writing down so next time this happens, to me or anyone else, although I am selfish. And I websearch for "AWSY and Firefox Profiler" I want this to be the number 1 result and help me (or someone else) out.

The normal instructions

First you need a build with profiling enabled. Put this in your mozconfig

ac_add_options --enable-debug
ac_add_options --enable-debug-symbols
ac_add_options --enable-optimize
ac_add_options --enable-profiling

The instructions to get the profiler to run came from Ted Campbell. Thanks Ted.

Ted’s instructions disabled stack sampling, we didn’t care about that since the data we need comes from profile markers. I can also run a reduced awsy test because 10 entries is enough to create the problem.

export MOZ_PROFILER_STARTUP=1
export MOZ_PROFILER_SHUTDOWN=awsy-profile.json
export MOZ_PROFILER_STARTUP_FEATURES="nostacksampling"
./mach awsy-test --tp6 --headless --iterations 1 --entities 10

But it crashes due to Bug 1710408.

So I can’t use nostacksampling, which would have been nice to save some memory/disk space, never mind.

So I removed that option, then I get profiles that are too short. The profiler records into a circular buffer so if that buffer is too small it’ll discard the earlier information. In this case I want the earlier information because I think something at the beginning is the problem. So I need to add this to get a bigger buffer. The default is 4 million entries (32MB).

export MOZ_PROFILER_STARTUP_ENTRIES=$((200*1024*1024))

But now the profiles are too big and Firefox shutdown times out (over 70 seconds) so the marionette test driver kills Firefox before it can write out the profile.

The solution

So we hack testing/marionette/client/marionette_driver/marionette.py to replace shutdown_timeout with 300 in some places. Setting DEFAULT_SHUTDOWN_TIMEOUT and also self.shutdown_timeout to 300 will do. There’s probably a way to pass a parameter, but I didn’t find it yet. So after making that change and running ./mach build the invocation is now:

export MOZ_PROFILER_STARTUP=1
export MOZ_PROFILER_SHUTDOWN=awsy-profile.json
export MOZ_PROFILER_STARTUP_FEATURES=""
export MOZ_PROFILER_STARTUP_ENTRIES=$((200*1024*1024))
./mach awsy-test --tp6 --headless --iterations 1 --entities 10

And it writes a awsy-profile.json into the root directory of the project).

Hurray!

Follow-up

Whimboo says that setting toolkit.asyncshutdown.crash_timeout might help. But it may wait until after some stuff has been implemented:

a solution here should also be to extend the toolkit.asyncshutdown.crash_timeout value

oh wait. actually we haven’t fixed that yet, but only use it via geckodriver`

— Whimboo

Avoiding large immediate values

Fri, 14 Sep 2018 00:00:00 +1000

We’re often told that we shouldn’t worry about the small details in optimisation, that either "premature optimisation is the root of all evil" or "the compiler is smarter than you". These things are true, in general. Which is why if you asked me about 10 years ago if I thought I would be using knowledge of machine code (not just assembly!) to improve a browser’s benchmark score by 2.5% I wouldn’t have believed you.

First of, I’m sorry (not sorry) for the gloating, and for what it’s worth the optimisation isn’t really that clever, and wasn’t even my idea. What I’m finding almost funny is that younger-me would not have believed that such low level details mattered this much.

Bump-pointer allocation

SpiderMonkey (Firefox’s JavaScript engine) separates its garbage collector into two areas, the nursery and the tenured heap. New objects are typically allocated first in the nursery, when the nursery is collected the object will be moved into the tenured heap if it is still alive. Collecting the nursery is faster than the whole heap since less data needs to be scanned, and most objects die when they are young. This is a fairly standard way to manage a garbage collector and is called generational garbage collection.

Allocating something in either heap should be fast, but since nursery allocation is more common it needs to be VERY fast. When JITing JavaScript code, allocation code is JITed right into the execution paths in each place it is needed.

I was working on a change to this code, I want to count the number of tenured and nursery allocations. And above all, I have to not add too much of a performance impact. That work is Bug 1473213 and isn’t actually the topic of this post, it’s just what drew my attention. (TL;DR: this work is Bug 1479360.)

The nursery fast-path looked like this, I’ve simplified it for easier reading, mostly by removing unnecessary things.

Register result(...), temp(...);
CompileZone* zone = GetJitContext()->realm->zone();
size_t totalSize = ...
void *ptrNurseryPosition = zone->addressOfNurseryPosition();
const void *ptrNurseryCurrentEnd = zone->addressOfNurseryCurrentEnd();

loadPtr(AbsoluteAddress(ptrNurseryPosition), result);
computeEffectiveAddress(Address(result, totalSize), temp);
branchPtr(Assembler::Below, AbsoluteAddress(ptrNurseryCurrentEnd), temp,
    fail);
storePtr(temp, AbsoluteAddress(ptrNurseryPosition));

That probably didn’t read right for most readers. What we’re looking at here is the code generator of the JIT compiler, this is not the allocation code itself, but the code that creates the machine code that does the allocation. I’ve broken it into two sections, the first five lines prepare some values and have absolutely zero runtime cost. The last five lines generate the code that does the bump pointer allocation. Function calls like loadPtr generate one or more machine code instructions:

loadPtr(AbsoluteAddress(ptrNurseryPosition), result): Read a pointer-sized value from memory at ptrNurseryPosition and store it in the register result. ptrNurseryPosition points to a pointer that points to the next free cell in the heap. So this places the pointer of the next free cell into the result register.
computeEffectiveAddress(Address(result, totalSize), temp): Use an lea or similar instruction to add totalSize (a displacement) to the contents of the result register, store the result of this addition into temp. After executing this temp will contain the pointer to the next free cell once we perform the current allocation.
branchPtr(..., AbsoluteAddress(ptrNurseryCurrentEnd), temp, fail): Compare the temp register’s contents against the contents of the memory at ptrNurseryCurrentEnd and if temp is higher, branch to the fail label. This compares the next value for the allocation pointer to the end of the heap, if the allocation would go beyond the end of the nursery then fail.
storePtr(temp, AbsoluteAddress(ptrNurseryPosition)): Store the new value for the next free cell (temp) into the memory at ptrNurseryPosition.

Unfortunately this isn’t as efficient as it could be.

Immediates and displacements

I’ve recently written about addressing in x86 where I wrote that instructions refer to operands and these operands may be registers, memory locations or immediate values. To recap, there are two main situations where some value can follow the instruction, it’s either as an immediate value or as a displacement for a memory operand.

Displacement: A displacement my be either 8 or 32 bits (on x86 running in 32 or 64 bit mode).
Immediate: An immediate value depends on the size of the operation, and may be 8, 16, 32 or 64 bits.

The point here, is that displacements cannot store a 64 bit value, so:

branchPtr(Assembler::Below, AbsoluteAddress(ptrNurseryCurrentEnd), temp,
    fail);

Cannot directly use 64 bit displacement (ptrNurseryPosition) for its memory operand, and requires an extra instruction to first load this value into a scratch register from an immediate (which can be 64 bit) before doing the comparison. This operation will now need three instructions rather than two (compare and jump are already separate instructions).

Intel provides a special exception to these rules about displacements for move instructions. There are four special opcodes for move that allow it to work with a 64-bit moffset. So:

loadPtr(AbsoluteAddress(ptrNurseryPosition), result);

Can be almost be represented. But these opcodes hard code result to the ax or eax registers, which is not suitable for a 64-bit value as this is. Therefore using 64-bit addresses also makes these loadPtr and storePtr operations use two instructions rather than one.

Here’s the disassembled code that this generates.

movabs $0x7ffff5d1b618,%r11
mov    (%r11),%rbx
lea    0x60(%rbx),%rbp
movabs $0x7ffff5d1b630,%r11
cmp    %rbp,(%r11)
jb     0x1f2f3ed1a351
movabs $0x7ffff5d1b618,%r11
mov    %rbp,(%r11)

This sequence, rather than being five instructions long is now eight instructions long (and 49 bytes) and makes more use of a scratch register (which may impact instruction-level parallelism).

The instruction cache

Instructions aren’t the only cost. This code sequence contains four 64-bit addresses, that’s a total of 32 bytes in the instruction stream (including the target for the jump on failed allocations). That takes up room in the CPU’s caches and other resources in the processor front-end.

The front-end of a processor’s pipeline must fetch and decode instructions before they’re queued, scheduled, executed and retired. Processor front-ends have changed a lot, and there are multiple levels of cacheing and buffering. Let’s use the Intel Core Microarchitecture as an example, it’s new enough to be in common use and things got more complex in the next microarchitecture due to having two different font-end pathways. The resource for this information is Intel’s optimisation reference manual.

Instructions are fetched 16-bytes at a time and immediately following the fetch a pre-decode pass occurs, a fast calculation of instruction lengths, Once the processor knows the lengths (and boundaries) of the instructions within the 16-bytes, they’re written into a buffer (the instruction queue) six at a time, if there are more than six instructions in the 16-byte block, then more cycles are used to pre-decode the remaining instructions. If fewer than six instructions were in the 16 bytes, or a read of less than 16 bytes occurred due to alignment or branching, then the full bandwidth of the pre-decode is not being utilised. If this happens often the instruction queue may starve.

The instruction queue is 18 instructions deep (but I think it’s shared by hyper-threading) instructions are decoded from this queue four or five at a time by the four decoders. One of the decoders is special and can handle some pairs of instructions turning them into a single operation.

Our instruction sequence above contains eight instructions, in 49 bytes. Assuming alignment is in our favour this will take four and pre-decode steps, averaging 2 instructions per pre-decode cycle; less than the CPU is capable of. (I don’t know how this behaves when an instruction crosses then 16-byte boundary, but back-of-the-envelope reasoning tells me it’s not a problem.)

This low instruction density might not be a problem in many situations, such as when the instruction cache already contains plenty of instructions and this bubble does not affect overall throughput. However in a loop or when other things already affect the processor’s pipeline, it could definitely be an issue.

The change

My colleague sfink had left a comment in the nursery string allocation path where he attempted to experiment with this in the past. His solution was eventually removed because it was a little bit fiddly, but it was the inspiration for my eventual change.

The code (tidied up) now looks like:

CheckedInt<int32_t> endOffset = (CheckedInt<uintptr_t>(uintptr_t(curEndAddr)) -
    CheckedInt<uintptr_t>(uintptr_t(posAddr))).toChecked<int32_t>();
MOZ_ASSERT(endOffset.isValid(),
    "Position and end pointers must be nearby");

movePtr(ImmPtr(posAddr), temp);
loadPtr(Address(temp, 0), result);
addPtr(Imm32(totalSize), result);
branchPtr(Assembler::Below, Address(temp, endOffset.value()), result, fail);
storePtr(result, Address(temp, 0));
subPtr(Imm32(size), result);

This loads a 64-bit address once and uses a relative address to describe the end of the nursery (the Address argument to the branchPtr call), then can re-use the original address when updating the current pointer (storePtr). We have to add the object size to result and subtract it later because we can’t easily get guaranteed access to another register with the way the code generator is written. So there are six operations in this sequence, let’s see the machine code:

movabs $0x7ffff5d1b618,%rbp
mov    0x0(%rbp),%rbx
add    $0x60,%rbx
cmp    %rbx,0x18(%rbp)
jb     0x164f300ea154
mov    %rbx,0x0(%rbp)
sub    $0x60,%rbx

Seven instructions long rather than eight, and 36 bytes rather than 49. This can be retrieved in three 16-byte transfers, rather than four. The instructions per fetch is now a 2 1/3 rather than 2.

Results

It doesn’t look like a huge improvement, seven instructions compared with eight?! But now it uses one less 16-byte fetch which means one less cycle to fill the pipeline for these instructions, in the right loop that could make a huge difference. It did make Firefox perform about 2.5% faster on the Speedometer benchmark when tested on my laptop (Intel Core i7-6600U, Skylake). Sadly we didn’t see any noticeable difference in our performance testing infrastructure (arewefastyet or perfherder). This could be because our CI systems have different CPUs that behave differently with regard to instruction lengths/density.

My examples above were for the simpler Core microarchitecture, whereas my testing was on a Skylake CPU and will be quite different. Starting with Sandy Bridge there are two paths for code to take through the CPU front end, and which one is used depends on multiple conditions. To simplify it, on tight enough loops the CPU is able to cache decoded instructions and execute them out of a μop cache.

Macro-fusion

Another difference is that with an absolute address used in the cmp instruction it could behave different with regard to macro-fusion (being fused with the jmp to execute as a single operation). I’m not sure if large displacements affect macro-fusion.

Update 2018-09-18

I received some feedback from Robert O’Callahan, he wrote with three suggestions.

Allocate all JIT code and globals within a single 2GB region and use RIP-relative addressing (x86-64), so that addresses will not be larger than 32bits. This is a good idea and I considered this for the jump instruction in that sequence which still uses a 64 bit address (because the jump is created before the label, and so the address is written after, it must leave 64bits of space for now).
Using known bit patterns in the nursery address range we could test for overflow by checking the value of the bits, avoiding an extra memory read. This is a great idea but will require some other work first.
The final subtraction might be skippable if the caller can handle an address to the end of the structure and use negative offsets, eg by filling in slots in the object using negative offsets. I’m skeptical if this will provide much benefit compared to the effort required to avoid the subtraction, or probably at best delay it.

Good First Bugs

Thu, 23 Aug 2018 00:00:00 +1000

One great way (of many) to get started in software development, particularly in open source, is to find good first bugs. This is a class of software bugs (which should be called issues, since they’re not always bugs) that are easy to fix with little experience. It can also be a great way, once you have software development skills, to learn a new domain or set of tools. Many projects, even well funded ones, are very happy to receive community contributions, if nothing else it’s one other way they can provide opportunities to the community.

At Mozilla we use bugzilla to track our bugs, and use the good first bug keyword to identify such bugs. You’re welcome to contribute patches for these bugs, and potentially have your work included in Firefox. You can also search by component, so the list of open good first bugs for the garbage collector is here and I’d be happy to help with any of these.

Good second bugs

As far as I know I created the concept of good second bugs. They’re not really second in the sense that you solve one good first bug then move on to a second bug.

To me this means that the contributor already have a fair amount of development experience but aren’t familiar with the domain. So let’s say you know C but you don’t know how to write a garbage collector or the theory behind it. A good second bug would be a bug filed against something like a garbage collector but not require any GC knowledge, but probably does require C knowledge and roughly 5 years of development experience. It might take such a person a couple of hours to solve, rather than 5 minutes.

The intention is that it can help someone get into contributing to a particular project or learn some new type of programming. Particularly when those topics are generally regarded as deep or complex (but all topics are deep/complex, I don’t think GC is special, but that’s another topic).

I have created both a good first bug and a good second bug tag for Plasma (my side project), based on this idea. Until there are more contributors I’m not sure if this distinction is useful, it has not been tested. I’ve also created labels for skills that each issue may require, knowing that most people probably don’t know Mercury which Plasma is written in.

Dissassembling Jit Code In Gdb

Tue, 07 Aug 2018 00:00:00 +1000

I’ve been making changes to the JIT in SpiderMonkey, and sometimes get a SEGFAULT, okay so open it in gdb, then this happens:

Thread 1 "js" received signal SIGSEGV, Segmentation fault.
0x0000129af35af5e9 in ?? ()

Not helpful, maybe there’s something in the stack?

(gdb) backtrace
#0  0x0000129af35af5e9 in  ()
#1  0x0000129af35b107d in  ()
#2  0xfff9800000000000 in  ()
#3  0xfff8800000000002 in  ()
#4  0xfff8800000000002 in  ()

Still not helpful, I’m reasonably confident the crash is in JITed code which has no debugging symbols or other info. So I don’t know what it’s actually executing when it crashed.

In case it’s not apparent, this is a short blog post where I can make notes of one way to get some more information when debugging JITed code.

First of all, those really large addresses (frames 2, 3 and 4) look suspicious. I’m not sure what causes that.

Now, I know the change I made to the JIT, so it’s likely that that’s the code that’s crashing, I just don’t know why. It would help to see what code is being executed:

(gdb) disassemble
No function contains program counter for selected frame.

What it’s trying to say, is that the current program counter at this level in the backtrace does not correspond with the C program (SpiderMonkey). Yes, unless we did a call or goto of something invalid, then we’re probably executing JITed code.

Let’s get more info:

(gdb) info registers
rax            0x7ffff54b30c0   140737308733632
rbx            0xe4e4e4e400000891       -1953184670468274031
rcx            0xc      12
rdx            0x7ffff54c1058   140737308790872
rsi            0xa      10
rdi            0x7ffff54c1040   140737308790848
rbp            0x7fffffff9438   0x7fffffff9438
rsp            0x7fffffff9418   0x7fffffff9418
r8             0x7fffffff9088   140737488326792
r9             0x8      8
r10            0x7fffffff9068   140737488326760
r11            0x7ffff5d2f128   140737317630248
r12            0x0      0
r13            0x0      0
r14            0x7ffff54a0040   140737308655680
r15            0x0      0
rip            0x129af35af5e9   0x129af35af5e9
eflags         0x10202  [ IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0

These are the values in the CPU registers. The debugger the rip (program counter) and rsp (stack pointer) and rbp (frame pointer) registers to know what it’s executing and to read the stack, including the calls that lead to this one. We can use this too, we’re going to use rip to figure out what’s being executed, it’s current value is 0x129af35af5e9.

(gdb) dump memory code.raw 0x129af35af5e9 0x129af35af600

Then in a shell:

$ hexdump -C code.raw
00000000  83 03 01 c7 02 4b 00 00  00 e9 82 00 00 00 49 bb
|.....K........I.|
00000010  a8 ab d1 f5 ff 7f 00                              |.......|

I have asked gdb, to write the contents of memory at the instruction pointer to a file named code.raw. Note that on x86-64 you need to write at least 15 bytes, as some instructions can be that long; I have 23 bytes.

I’d normally disassemble code using the objdump program:

$ objdump -d code.raw
objdump: code.raw: File format not recognised

In this case it needs extra clues about the raw data in this file. We tell it the file format, the machine "i386" and give the disassembler more information about the machine "x86-64".

$ objdump -b binary -m i386 -M x86-64 -D code.raw

code.raw:     file format binary


Disassembly of section .data:

00000000 <.data>:
   0:   83 03 01                addl   $0x1,(%rbx)
   3:   c7 02 4b 00 00 00       movl   $0x4b,(%rdx)
   9:   e9 82 00 00 00          jmpq   0x90
   e:   49                      rex.WB
   f:   bb a8 ab d1 f5          mov    $0xf5d1aba8,%ebx
  14:   ff                      (bad)
  15:   7f 00                   jg     0x17

Yay. I can see the instruction it crashed on. Adding the number 1 to the 32-bit value stored at the address pointed to by rbx. I’d like some more context, so I have to get the instructions that lead to this. Note that after the jmpq instruction nothing makes sense, that’s okay since that jump is always taken.

(gdb) dump memory code.raw 0x2ce07c3895e6 0x2ce07c3895f7
...
$ objdump -b binary -m i386 -M x86-64 -D code.raw

code.raw:     file format binary


Disassembly of section .data:

00000000 <.data>:
   0:   49 8b 1b                mov    (%r11),%rbx
   3:   83 03 01                addl   $0x1,(%rbx)
   6:   c7 02 4b 00 00 00       movl   $0x4b,(%rdx)
   c:   e9 82 00 00 00          jmpq   0x93

When I go back three bytes I get lucky and find another valid instruction that also makes sense.

(gdb) dump memory code.raw 0x2ce07c3895e5 0x2ce07c3895f7
...
$ objdump -b binary -m i386 -M x86-64 -D code.raw

code.raw:     file format binary


Disassembly of section .data:

00000000 <.data>:
   0:   00 49 8b                add    %cl,-0x75(%rcx)
   3:   1b 83 03 01 c7 02       sbb    0x2c70103(%rbx),%eax
   9:   4b 00 00                rex.WXB add %al,(%r8)
   c:   00 e9                   add    %ch,%cl
   e:   82                      (bad)
   f:   00 00                   add    %al,(%rax)
        ...

Gibberish. Unfortunately I just have to guess which byte an instruction might begin on. Or go back byte-by-byte finding instructions that make sense. There was quiet a bit of experimentation, and a lot more gibberish until I found:

(gdb) dump memory code.raw 0x2ce07c3895dd 0x2ce07c3895f7
...
$ objdump -b binary -m i386 -M x86-64 -D code.raw

code.raw:     file format binary


Disassembly of section .data:

00000000 <.data>:
   0:   bb 28 f1 d2 f5          mov    $0xf5d2f128,%ebx
   5:   ff                      (bad)
   6:   7f 00                   jg     0x8
   8:   00 49 8b                add    %cl,-0x75(%rcx)
   b:   1b 83 03 01 c7 02       sbb    0x2c70103(%rbx),%eax
  11:   4b 00 00                rex.WXB add %al,(%r8)
  14:   00 e9                   add    %ch,%cl
  16:   82                      (bad)
  17:   00 00                   add    %al,(%rax)
        ...

This is almost correct (except for all the gibberish). But at least it starts on an instruction that kind-of makes sense with a valid-looking memory address. But wait, that instruction uses ebx a 32-bit register. Which is not what I’m expecting since the code I’m JITing works with 64-bit memory addresses. And all that gibberish could be part of a memory address, it has bytes like 0xff and 0x7f in it!

I go back one more byte:

(gdb) dump memory code.raw 0x2ce07c3895dc 0x2ce07c3895f7
...
$ objdump -b binary -m i386 -M x86-64 -D code.raw

code.raw:     file format binary


Disassembly of section .data:

00000000 <.data>:
   0:   49 bb 28 f1 d2 f5 ff    movabs $0x7ffff5d2f128,%r11
   7:   7f 00 00
   a:   49 8b 1b                mov    (%r11),%rbx
   d:   83 03 01                addl   $0x1,(%rbx)
  10:   c7 02 4b 00 00 00       movl   $0x4b,(%rdx)
  16:   e9 82 00 00 00          jmpq   0x9d

Got it. That’s a long instruction (which I’ll talk more about in my next article) Now that we have the extra byte at the beginning. x86 has prefix bytes for some instructions which can override some things about the instruction. In this case 0x49 is saying this instruction operates on 64-bit data (well 0x48 says that and +1 is part of the register address).

And there’s the bug (3rd line). I’m dereferencing this address, the one that I load into r11 once, and then again during the addl. I should only de-reference it once. The cause was that I misunderstood SpiderMonkey’s macro assembler’s mnemonics.

Update 2018-08-07

One response to this pointed out that I could have just used:

(gdb) disassemble 0x12345, +0x100

To disassemble a range of memory, and wouldn’t have had the "No function contains program counter for selected frame." error. They even suggested I could use something like:

(gdb) disassemble $rip-50, +0x100

I’ll definitely try these next time, they might not be the exact syntax. I haven’t tested them..

Update 2018-08-18

Another tip is to use: x/20i $pc

That’s the whole command. x means that GDB should use the $pc as a memory location and not as a literal; /20i means "treat that memory location as containing instructions and show 20 of them"

You can also use this with display, like in display x/4i $pc so that every time you stepi, it will auto-print the next 4 instructions.

Static Assert Type In Cplusplus

Wed, 18 Jul 2018 00:00:00 +1000

Static type checking helps us be more confident that our software does what we think it does. But it can’t see everything, and this post was originally going to share a neat C++ feature that might have helped me be a little more confident about the code I’m writing. However just after I started writing this I found that it’s not necessary and there is (doh, I should have known) a nice C way to get the same check. However I still want to write it because it might be handy to remember this in the future.

I’m trying to add a counter to SpiderMonkey to count how many nursery, and tenured allocations there are. I’ve implemented this for the non-JIT paths and now it’s time to implement the same for the JIT paths. Here’s the code for the JIT during a nursery allocation:

add32(Imm32(1), AbsoluteAddress(zone->addressOfNurseryAllocCount()));

This code isn’t what runs at runtime (kinda) it’s part of the JIT and is executed when we want to generate code that performs a nursery allocation. In other words, it doesn’t perform the allocation itself, the machine code it generates will.

This causes the JIT to write instruction(s) (usually 1) that perform a 32-bit add. They add the immediate value 1, to the value contained at the address provided by the call zone->addressOfNurseryAllocCount(). This returns a pointer to a uint32_t value. However the AbsoluteAddress constructor will cast this to a void pointer, before writing the 32bit add instruction(s) using it.

This means that if in 6 months time I decide that we need 64 bit counters, or want to save so much memory that 16 bit counters would be better. That if the type of CompileZone::addressOfNurseryAllocCount() changed from uint32_t* to uint64_t* we’d have a problem (yes we would, popular platforms like x86 are little endian and will add the wrong bytes together).

So I wanted some kind of check here that if someone did make this change, they’d get a compiler error and change the add32 to an add64 for example. So I used:

static_assert(mozilla::IsSame<uint32_t*,
    decltype(zone->addressOfNurseryAllocCount())>::value,
    "JIT expects this to be a 32bit counter");
add32(Imm32(1), AbsoluteAddress(zone->addressOfNurseryAllocCount()));

Note that mozilla::IsSame is like std::is_same, it uses the template system to substitute in a different value for its value member depending on if the substituted types are the same. If they are, then the value is non-zero and the static_assert is accepted. But if the type of this function were to change then the assertion would fail, exactly what we want!

But there’s a simpler way. Just create a local variable of the desired type and assign to it first.

uint32_t *allocCount = zone->addressOfNurseryAllocCount();
add32(Imm32(1), AbsoluteAddress(allocCount));

The compilation will fail if the coercion implied by the assignment isn’t possible or safe. However, this won’t prevent all coercions (and hence my confusion), a uint32_t may be coerced to a uint64_t, but not a uint32_t* to a uint64_t*. So the simple solution is all we need here.

I’ve been through much of the existing JIT code today and made this type of improvement in many places Bug 1476500.

I wonder if dependent types can ensure that a code generator generates (more) type correct code?

icecc and ccache - Compiling lots of C++ quickly

Fri, 04 Aug 2017 00:00:00 +1000

XKCD #303

Firefox is a big project and takes quite some time to compile. If you’re working on such a large project, you make a change, recompile, accidentally touch a header, recompile then lose a lot of time waiting and resort to checking social media or bouts of wheelie-chair jousting.

Within my home office, I’ve set up icecc and ccache on Linux Mint (similar to Ubuntu) on amd64 using GCC. I haven’t yet tried clang but probably will soon, instructions should be similar for other OSs and compilers, but YMMV. I’m going to give instructions for setting up both tools at the same time, they can be used independently but if want to compile large C/C++ projects often, you probably want both of them.

icecc (pronounced ice cream) is a tool to distribute c-compiler jobs among a network of peers for parallel compilation. Think make -j but across multiple computers. You may have heard of distcc, it’s like that but smarter at scheduling jobs. Use icecc (or distcc if that’s your thing) if you have some spare (or used) computers to distribute compilation across. Some Mozillians (not sure I like that word yet, labels etc) setup icecc groups within the Mozilla offices. I’m told from someone who tried that it’s not worth connecting to these from home. I’m also avoiding WiFi for this reason and more. Also on connecting to other networks, be aware this could be a security issue, a peer could replace your code with something nasty and trick you into running it.

ccache is a c compiler cache. If you’re recompiling the same project often this ccache will remember the .o file generated previously and return it rather than running the compiler again.

sudo aptitude install ccache icecc

And on your workstation you can also install icecc-monitor

icecc uses two daemons, iceccd accepts jobs and runs them locally on each node, it connects to icecc-scheduler which manages the jobs for a group of machines and distributes them. I believe it’s supposed to work if you have multiple icecc-schedulers on your network, but I found that this would easily create two separate smaller clusters as my laptop came and went from the network. Instead it was simple to disable icecc-scheduler on all but one of the nodes.

sudo update-rc.d icecc-scheduler disable
sudo service icecc-scheduler stop

icemon idle

icecc-monitor is a GUI application that will let me visualise the cluster and its jobs. It was useful at this point to confirm that things were working, start it as part of your usual desktop environment. You should see the nodes of your cluster sitting idle. In the image you can see my two nodes, "fluorine" and "oxygen", I will be adding "neon" soon and have 16 cores/threads at my disposal.

If you’re not seeing this then I’ve found that:

You may have to disable your firewall, I use a firewall on my laptop when I’m out-and-about but find I have to turn it off to use icecc in my home office.
Restart the iceccd daemons to get them to connect to the scheduler.
Close and reopen icemon any time you restart the scheduler.

We could use icecc on its own, it’s as simple as adding /usr/lib/icecc/bin to your path. Instead of doing that we’ll add ccache. ccache likes a lot of hard disk space, 15-20GB is suitable if you’re working on Firefox, which usually uses about 5GB per build (reported by shu, I didn’t measure myself). I use btrfs with which I like to use snapshots, but there’s no point snapshotting my ccache, instead I created a new logical volume, used ext4 (remember to use noatime or relatime in your mount options (for any FS)) and mounted that at /mnt/ccache, depending on how your system is configured these steps could be quite different, or you might not use a separate filesystem at all. (I wish installers would let me name the volume group.)

Make the filesystem:

sudo lvcreate -L 20G -n ccache mint-vg
sudo mkfs.ext4 /dev/mint-vg/ccache

Put this in /etc/fstab:

/dev/mapper/mint--vg-ccache /mnt/ccache ext4    errors=remount-ro,noatime 0
2

Mount the filesystem a and make one directory per user in it, that’s probably just one directory:

sudo mkdir /mnt/ccache
sudo mount /mnt/ccache
sudo mkdir /mnt/ccache/paul
sudo chown paul:paul /mnt/ccache/paul

And put this in your user’s ~/.ccache/ccache.conf, set the size here, and the filesystem size appropriately. Most filesystems run more smoothly with some free space, your SSD may be happier with some free space too.

max_size = 17G
cache_dir = /mnt/ccache/paul

One more thing, tell ccache to use icecc to run the compiler. I put this in my ~/.bashrc.

export CCACHE_PREFIX=icecc

Time to test it out. I’m not sure how this works generally, but for SpiderMonkey (JS shell only) builds you simply add --with-ccache to your ./configure arguments, then build with make -j12 (since I have 12 cores/threads in my two machines). For Firefox itself a build is normally configured by placing a mozconfig file in the project root directory. Add to that file:

ac_add_options --with-ccache=/usr/bin/ccache
mk_add_options MOZ_MAKE_FLAGS="-j12"

I haven’t measured the effects of using either ccache or icecc, but I’ve definitely noticed that ccache can speed up repeated builds. I also suspect that some parallel slackness (adding more tasks than there are cores) could help speed things up to cover some latency introduced by ccache.

The icecc-monitor program has a number of different views. The image above was "Star view" I think my favorite is "Gantt view" (below).

icemon running

Update 2017-08-17

Most of ccache’s options can be controlled by either configuration options in ~/.ccache/ccache.conf or by environment variables. Therefore I have removed CCACHE_PREFIX from my ~/.bashrc file and instead added it to ccache.conf.

I have also learnt that when -g (or similar) is on the command line ccache will hash the directory name (I think the working directory) and incorporate that into its cache. This ensures that any path references in debugging symbols resolve correctly and don’t mislead you during a debugging session. Which is a good idea, but if most of your builds use -g and you use multiple workspaces for the same projects it can lead to more cache misses. If you don’t use a debugger often, and promise set to CCACHE_NOHASHDIR when you do (or be confused by references to different source files), then this can be disabled with the hash_dir option.

Now my ccache.conf looks like

max_size = 17G
cache_dir = /mnt/ccache/paul
prefix_command=icecc

# Might get confusing for debugging
hash_dir = false