Hexer – A hex editor

Posted on 2015-09-27 by petert

Way back in July I was doing some simple binary data file reverse engineering, and got annoyed at the fact that I couldn’t find a free hex editor for Windows which offered all of the following:

Seeing all interpretations of data (e.g. various integer types, floats and text) while hovering over some binary data.
Searching data of a given type (e.g. uint32) quickly and conveniently in different numerical formats (e.g. decimal or hex).
Marking identified pieces of data with their type and some notes.

I thought “it can’t be that hard now, can it?” and set out to create my own. After some deliberation I choose C#/Winforms to implement it, simply because the tool support is second to none and I didn’t want to waste more time on UI stuff than strictly necessary.

After spending a few hours on it way back then, and finally a few more today, it has turned into quite a usable (but far from complete or user-friendly) program. I called it Hexer, which is both appropriate in English and also means “Warlock” (or even “Witcher”, literally) in German.

It has all the features which I was missing:

The pane on the left shows various interpretations of selected and hovered-over data.
You can easily enter different types of numerical addresses written in any C-style string format (e.g. “161″ searches for the decimal number 161, “0xFF” searches for decimal 255 and 010 searches for decimal 8).
As seen in the screenshot, you can mark ranges of data with some data type, and see the in-line interpretation of that data in the file. These markers can be stored and loaded independently of the files they apply to.

I put the source up on GitHub here, and here’s an executable if you want to give it a try. (You’ll need the right version of the .net Runtime of course)

Of the entire implementation I like this part best, which is a simple descriptive listing of all the data types, including their properties (such as name and size) as well as the ability to convert values of that type to and from raw binary and strings. It’s succinct, easy to extend (both with new data types and new meta-information about them), and many of the UI elements are generated directly from that list.

Hexer is far from fully-featured – I put up a short list of TODOs in the Github readme, but there’s a lot more which could (and should) be done.

Dark Dreams Don’t Die (D4) Alternative Launcher

Posted on 2015-06-07 by petert

D4 PC released (on Steam and elsewhere) recently. The game uses UE3, so it works well on PC, and the mouse controls are also well done. The only problem is that the built-in launcher presents a fixed list of resolutions rather than all available ones, and also somehow messes up with DPI scaling.

I wrote a new Launcher in C# which doesn’t have this issue. It’s available here.

I also made the source code available on Github. Not because I’m very proud of it or because it’s particularly interesting (it’s just a simple Winforms C# app written in an hour or so), but because I hope if I put it in the public domain it could help improve this state of affairs in the future.

Well, that’s all.

Graveyard of Forgotten Projects (Part 2)

Posted on 2015-02-22 by petert

This is a direct continuation of the first part. Let this partial quote from there serve as an introduction:

This series (hopefully) of blog posts will gather and shortly describe some previously unreleased and/or forgotten projects I’ve worked on over the past decade and a half, starting from 2000. This is by no means a complete listing, I’m still skipping over the vast majority of unfinished ideas and half-baked projects, but I’m trying to include everything which worked to an extent or has some interesting features.

The primary purpose of this is to serve as an archive for myself, because, as you will notice when I go through the list, I’ve already lost a lot of somewhat interesting stuff and would rather not have that happen again. Perhaps one thing or another might also be useful to someone else, but I obviously won’t troubleshoot code I wrote more than a decade ago and haven’t touched since then! And, just to make that clear, it’s also obviously not indicative of my current skills.

I stopped the previous post around 2003-2004, a timeframe of abortive attempts at creating a 3D multiplayer physics-driven space ball game.

Rudloid

I wasn’t sure where to slot this in, as I started this project in ~2002, but did a lot of work on it in 2005. I chose to go with something closer to the latter than the former. This is an Arkanoid clone written in Ruby. It used the RUDL SDL wrapper (another amazing example of my naming sense really), and initially did software rendering. Later on I switched to using OpenGL for the rendering for performance reasons, which is why the final 2005 iteration was called RudloidGL.

What is interesting about this project is that it’s actually written rather well, in some ways at least. For example, I implemented features like collision detection, sound effects and sprite animation as Ruby mixins, and they are re-used in every object where it makes sense. So the balls, paddle, blocks etc. use the same basic collision code with distinct response events, and animated blocks, paddles and balls use the exact same sprite animation code just with slightly different parameters.

Of course, there are also some things were I was maybe going a bit too far with trying to be smart. For example, I used the “evil” library (yes, that’s actually its name) to implement different ball/paddle/etc. types and changing between them (when picking up a powerup pill) by actually changing the class type. It works, and in a way it’s elegant, but it’s also a bit wild.

Talking about powerups, that’s actually another thing where the code for this game is a lot better already than most of what I discussed in the first post in this series. For example, I seem to have had attained a decent understanding of the “once and only once” principle at this point, and the value of setting up tools in the code base so that it can be effectively used. For example, this is the actual, entire definition code for all 12 powerup pills in the game:

gen_pill_class :SlowPill, :SLOWPILLFILE, -100, "Balls.instance.speedfactor *= 0.8"
gen_pill_class :SpeedPill, :SPEEDPILLFILE, 300, "Balls.instance.speedfactor += 0.15"

gen_pill_class :SplitPill, :SPLITPILLFILE, 0, "Balls.instance.split!"
gen_pill_class :TetrisPill, :TETRISPILLFILE, 400,  "RudloidGame.instance.blocks.add_row!"

gen_pill_class :FireBallPill, :FIREBALLPILLFILE, 0, "Balls.instance.change_ball_type!(FireBall, 1000)"
gen_pill_class :PhantomBallPill, :PHANTOMBALLPILLFILE, 0, "Balls.instance.change_ball_type!(PhantomBall, 600)"
gen_pill_class :MoneyBallPill, :MONEYBALLPILLFILE, 0, "Balls.instance.change_ball_type!(MoneyBall, 1200)"
gen_pill_class :CrazyBallPill, :CRAZYBALLPILLFILE, 1000, "Balls.instance.change_ball_type!(CrazyBall, 1000)"

gen_pill_class :PaddlePlusPill, :PADDLEPLUSPILLFILE, -100,	"RudloidGame.instance.paddle.changewidth(25)"
gen_pill_class :PaddleMinusPill, :PADDLEMINUSPILLFILE, 300, "RudloidGame.instance.paddle.changewidth(-25)"
			
gen_pill_class :StickyPaddlePill, :STICKYPADDLEPILLFILE, 200,	"RudloidGame.instance.paddle.change_to(StickyPaddle)"
gen_pill_class :MirrorPaddlePill, :MIRRORPADDLEPILLFILE, 700, "RudloidGame.instance.paddle.change_to(MirrorPaddle)"

gen_pill_class :SlowPill, :SLOWPILLFILE, -100, "Balls.instance.speedfactor *= 0.8"

gen_pill_class :SpeedPill, :SPEEDPILLFILE, 300, "Balls.instance.speedfactor += 0.15"

gen_pill_class :SplitPill, :SPLITPILLFILE, 0, "Balls.instance.split!"

gen_pill_class :TetrisPill, :TETRISPILLFILE, 400, "RudloidGame.instance.blocks.add_row!"

gen_pill_class :FireBallPill, :FIREBALLPILLFILE, 0, "Balls.instance.change_ball_type!(FireBall, 1000)"

gen_pill_class :PhantomBallPill, :PHANTOMBALLPILLFILE, 0, "Balls.instance.change_ball_type!(PhantomBall, 600)"

gen_pill_class :MoneyBallPill, :MONEYBALLPILLFILE, 0, "Balls.instance.change_ball_type!(MoneyBall, 1200)"

gen_pill_class :CrazyBallPill, :CRAZYBALLPILLFILE, 1000, "Balls.instance.change_ball_type!(CrazyBall, 1000)"

gen_pill_class :PaddlePlusPill, :PADDLEPLUSPILLFILE, -100, "RudloidGame.instance.paddle.changewidth(25)"

gen_pill_class :PaddleMinusPill, :PADDLEMINUSPILLFILE, 300, "RudloidGame.instance.paddle.changewidth(-25)"

gen_pill_class :StickyPaddlePill, :STICKYPADDLEPILLFILE, 200, "RudloidGame.instance.paddle.change_to(StickyPaddle)"

gen_pill_class :MirrorPaddlePill, :MIRRORPADDLEPILLFILE, 700, "RudloidGame.instance.paddle.change_to(MirrorPaddle)"

While eval’ing some random strings would make my hair stand on end now, at least the heart for this code is in the right place, and I still appreciate how concise it is.

The game also features a level editor, which nicely reuses most of the actual game rendering and related code. It also allows for undo/redo and copy/paste operations.

On the whole, this isn’t too bad really. What was bad, no, terrible are the sound effects. I never spent much time on those and it really, really shows. Check it out for yourself in this Shadowplay video I took of the game (after spending literally 2 hours finding all the dependencies in compatible versions and getting them to work together).

I will not upload the .zip for this, since I don’t think I am actually allowed to distribute some of the music and sound effects I used. If anyone cares a lot, post a comment and I’ll try to create a working package without those.

Wasser

The imaginatively-named “Wasser” (German for Water) is a very small program I hacked up in early 2005 in order to generate daily data fitting a decent-looking curve from monthly data about the electricity production from hydroelectric power plants. Until I made this, that was apparently a process carried out manually in Excel (I shit you not).

There are two interesting things about this one:

It uses both Tk for the UI window and SDL (via RUDL) for the visualization window. That’s insane, but I guess I just used what I knew, and it works.
I came up with the method of how to distribute data across the days of each month on my own, without really knowing how one would actually solve such a thing correctly. Nonetheless, what I do in the end isn’t too different from what an explicit iterative solver would look like.

You can get it here, though I have no idea why you’d want to.

Ragex

Released (actually released!) Christmas 2005, Ragex is a tiny generator for XHTML. I don’t need to write much about it here, because the web page is still online, I’m just including it for completeness’ sake. The best part about this one was when I got an email from someone actually using it, in 2008 or so, when I had almost forgotten creating it.

Crystalise

After my 3D experiments in 2004, and other game projects that never quite reached completion, I really wanted to actually finish a game. That’s why I came up with a very simple action/puzzle principle, and Crystalise was born (also check out the help file for a visual explanation of its gameplay).

Here’s a bullet point summary:

It’s made for PSP, the hottest gaming device of 2005
The vast majority of the game is written in Lua, with some of the heavy lifting (rendering, collision detection) in C. Pretty modern in that regard!
As far as I can tell, I completely lost the entire C source code for this project, as well as all the original (non-distribution) assets. A valuable lesson about backups, and even more so the value of releasing the source. The lua code at least is included in the distribution.
It really is a complete game, including stuff like a main menu, difficulty levels and high score tracking.
The levels (including boss and special levels) are procedurally generated, probably one reason why I actually managed to finish this game.

If you hit a certain level (which probably no one except me ever did) the game increases the PSP clock speed to 300 MHz in order to keep up solid 60 FPS:

	if self.levelnum == Difficulty.clocklevel then
		System.setClock(300)
		Window:showText("Not Bad", "Impressed by your formidable prowess,\nthe Crysts had to request more power!\n\n(PSP clock speed increased)")
	end

if self.levelnum == Difficulty.clocklevel then

System.setClock(300)

Window:showText("Not Bad", "Impressed by your formidable prowess,\nthe Crysts had to request more power!\n\n(PSP clock speed increased)")

end

I believe I invented the now commonly used paradigm for writing text with a analog gamepad in this game. That is, selecting a direction with the analog stick and then selecting a character from that direction using a button:

This is one of those projects where I’m really sad I lost (part of) the source. At least in this case most of the Lua is still available. And some of it is pretty neat, check e.g. circle.lua which (despite the name) keeps track of and handles updates to the current state of the level.

LuXr

This is a C#-based program launching (and other stuff) tool that I’m still quite proud of. I made and released it in early 2006, and the web page is still online. I just downloaded and ran the 54 kB package, and it still works well on my current Windows 7 installation.The major reason I created this was because I was fed up with the Windows (pre-Vista) way of launching programs. Of course, Microsoft seemingly agreed, and the keyboard / string search driven way of using the start menu in Vista onwards pretty much made LuXr obsolete.

Still, there are a few things to like about this one:

The aesthetics for the startup effect were influenced by the OS in Serial Experiments Lain. Which was much more awesome of course.
It features a plugin system which dynamically loads plugins written in C#, which can have their own config dialogue pages (they are adressed by adding a prefix to what you type, e.g. “g something” to google “something” with the google plugin, or “e” to use the eval plugin:
I still prefer how this resolves substring matches comapred to the Windows Start menu search. For example, if I write “vis 13″ in the start menu on my current PC, it finds no matches. LuXr finds “Visual Studio 2013″.

What’s not to like is that I believe I lost the source for this one as well (except for the plugins, which are part of the distribution package). It seems that with the years progressing I got better about actually releasing stuff from time to time, but worse about keeping backups.

That’s it for today, next time around we’ll get to actual 3D graphics programming. WHOOOHOO!

Wrapper_gen, a wrapper generator for COM interfaces

Posted on 2013-07-31 by petert

DSfix was based on a Direct3D9 wrapper, which was mostly taken from an existing code base and extended manually.

Recently, I’ve needed to hook Direct3D9Ex, and came to the conclusion that the manual busy work of writing the initial wrapper is better left to a computer than a human. Therefore, I wrote a Ruby script which takes a Microsoft COM dll header interface specification, and generates the C++ code for a wrapper class for it.

Here’s the script (wrapper_gen.rb), it’s rather tiny:

# parameter check
if(ARGV.size < 3 || ARGV.size > 4) 
    puts "Usage:   ruby wrapper_gen.rb INTERFACE_NAME IN_FILE_NAME OUT_NAME [LOG?]"
    puts "Examples: "
    puts "  - ruby wrapper_gen.rb IDirect3DTexture9 d3d9.h d3d9tex"
    puts "    produces d3d9tex.h and d3d9tex.cpp with no logging"
    puts "  - ruby wrapper_gen.rb IUnknown d3d9.h unknown true"
    puts "    produces unknown.h and unknown.cpp"
    puts "    and adds logging to generated wrapper methods"
    puts "wrapper_gen version 0.2"
    exit(0)
end

# parameters
interface_name = ARGV[0]
file_name = ARGV[1]
out_name = ARGV[2]
logging = ARGV.size > 3 && ARGV[3] != "false"

# logging spec
log_string_pre = 'SDLOG(20, "'
log_string_post = '\n");'

# regexps used for input file parsing
interface_exp = /DECLARE_INTERFACE_\(\s*#{interface_name}\s*,\s*(\w+)\s*\)\s*\{([^}]*)\s*};/m
method_exp1 = /STDMETHOD\s*\(\s*(\w+)\s*\)\(([^)]+)\)/
method_exp2 = /STDMETHOD_\s*\(\s*(\w+)\s*,\s*(\w+)\s*\)\(([^)]+)\)/
param_exp = /^(.+?)\s+(\w+)$/

# used to write header to h and cpp files
def write_header(file, iname, fname)
    file.puts "// wrapper for #{iname} in #{fname}"
    file.puts "// generated using wrapper_gen.rb"
    file.puts ""
end

# do stuff
f = IO.binread(file_name)
decl = interface_exp.match(f)

if(!decl)
    puts "Could not find declaration of interface #{interface_name} in #{file_name}"
    exit(0)
end

File.open("#{out_name}.h", "w+") { |hfile|
File.open("#{out_name}.cpp", "w+") { |cppfile|

write_header(hfile, interface_name, file_name)
hfile.puts "#include \"#{file_name}\""
hfile.puts ""
hfile.puts "interface hk#{interface_name} : public #{interface_name} {"
hfile.puts "    #{interface_name} *m_pWrapped;"
hfile.puts "    "
hfile.puts "public:"
hfile.puts "    hk#{interface_name}(#{interface_name} **pp#{interface_name});"
hfile.puts "    "
hfile.puts "    // original interface"

write_header(cppfile, interface_name, file_name)
cppfile.puts "#include \"#{out_name}.h\""
cppfile.puts ""
cppfile.puts "hk#{interface_name}::hk#{interface_name}(#{interface_name} **pp#{interface_name}) {"
cppfile.puts "    m_pWrapped = *pp#{interface_name};"
cppfile.puts "    *pp#{interface_name} = this;"
cppfile.puts "}"

decl[2].each_line do |line|
    m1 = method_exp1.match(line)
    m2 = method_exp2.match(line)
    if(m1 || m2)
        # header
        l = line.strip.gsub(/,(\S)/,', \1').gsub(" PURE","").gsub("PURE","")
        l = l.gsub("THIS_ ", "").gsub("THIS", "")
        hfile.puts "    #{l}"
        # parse
        name = m1 ? m1[1] : m2[2]
        rettype = m2 ? m2[1] : "HRESULT"
        param_string = m1 ? m1[2] : m2[3]
        params = param_string.split(",")
        params.reject! {|p| p.downcase == "this" }
        params.map! {|p| p.gsub("THIS_ ", "").strip }
        pnum = 0
        params_split = params.map do |p| 
            m = param_exp.match(p)
            if(m)
                [m[1], m[2]]
            else
                pnum += 1
                [p, "param_#{pnum}"]
            end
        end
        param_string = params_split.map { |pair| "#{pair[0]} #{pair[1]}" }.join(", ")
        # cpp
        cppfile.puts ""
        cppfile.puts "#{rettype} APIENTRY hk#{interface_name}::#{name}(#{param_string}) {"
        cppfile.puts "    #{log_string_pre}hk#{interface_name}::#{name}#{log_string_post}" if logging
        cppfile.puts "    return m_pWrapped->#{name}(#{params_split.map {|p| p[1]}.join(", ")});"
        cppfile.puts "}"
    end
end

hfile.puts "};"
hfile.puts ""

} # close cppfile
} # close hfile

100

101

102

103

104

105

106

107

# parameter check

if(ARGV.size < 3 || ARGV.size > 4)

puts "Usage: ruby wrapper_gen.rb INTERFACE_NAME IN_FILE_NAME OUT_NAME [LOG?]"

puts "Examples: "

puts " - ruby wrapper_gen.rb IDirect3DTexture9 d3d9.h d3d9tex"

puts " produces d3d9tex.h and d3d9tex.cpp with no logging"

puts " - ruby wrapper_gen.rb IUnknown d3d9.h unknown true"

puts " produces unknown.h and unknown.cpp"

puts " and adds logging to generated wrapper methods"

puts "wrapper_gen version 0.2"

exit(0)

end

# parameters

interface_name = ARGV[0]

file_name = ARGV[1]

out_name = ARGV[2]

logging = ARGV.size > 3 && ARGV[3] != "false"

# logging spec

log_string_pre = 'SDLOG(20, "'

log_string_post = '\n");'

# regexps used for input file parsing

interface_exp = /DECLARE_INTERFACE_$\s*#{interface_name}\s*,\s*(\w+)\s*$\s*\{([^}]*)\s*};/m

method_exp1 = /STDMETHOD\s*$\s*(\w+)\s*$$([^)]+)$/

method_exp2 = /STDMETHOD_\s*$\s*(\w+)\s*,\s*(\w+)\s*$$([^)]+)$/

param_exp = /^(.+?)\s+(\w+)$/

# used to write header to h and cpp files

def write_header(file, iname, fname)

file.puts "// wrapper for #{iname} in #{fname}"

file.puts "// generated using wrapper_gen.rb"

file.puts ""

end

# do stuff

f = IO.binread(file_name)

decl = interface_exp.match(f)

if(!decl)

puts "Could not find declaration of interface #{interface_name} in #{file_name}"

exit(0)

end

File.open("#{out_name}.h", "w+") { |hfile|

File.open("#{out_name}.cpp", "w+") { |cppfile|

write_header(hfile, interface_name, file_name)

hfile.puts "#include \"#{file_name}\""

hfile.puts ""

hfile.puts "interface hk#{interface_name} : public #{interface_name} {"

hfile.puts " #{interface_name} *m_pWrapped;"

hfile.puts " "

hfile.puts "public:"

hfile.puts " hk#{interface_name}(#{interface_name} **pp#{interface_name});"

hfile.puts " "

hfile.puts " // original interface"

write_header(cppfile, interface_name, file_name)

cppfile.puts "#include \"#{out_name}.h\""

cppfile.puts ""

cppfile.puts "hk#{interface_name}::hk#{interface_name}(#{interface_name} **pp#{interface_name}) {"

cppfile.puts " m_pWrapped = *pp#{interface_name};"

cppfile.puts " *pp#{interface_name} = this;"

cppfile.puts "}"

decl[2].each_line do |line|

m1 = method_exp1.match(line)

m2 = method_exp2.match(line)

if(m1 || m2)

# header

l = line.strip.gsub(/,(\S)/,', \1').gsub(" PURE","").gsub("PURE","")

l = l.gsub("THIS_ ", "").gsub("THIS", "")

hfile.puts " #{l}"

# parse

name = m1 ? m1[1] : m2[2]

rettype = m2 ? m2[1] : "HRESULT"

param_string = m1 ? m1[2] : m2[3]

params = param_string.split(",")

params.reject! {|p| p.downcase == "this" }

params.map! {|p| p.gsub("THIS_ ", "").strip }

pnum = 0

params_split = params.map do |p|

m = param_exp.match(p)

if(m)

[m[1], m[2]]

else

pnum += 1

[p, "param_#{pnum}"]

end

param_string = params_split.map { |pair| "#{pair[0]} #{pair[1]}" }.join(", ")

# cpp

cppfile.puts ""

cppfile.puts "#{rettype} APIENTRY hk#{interface_name}::#{name}(#{param_string}) {"

cppfile.puts " #{log_string_pre}hk#{interface_name}::#{name}#{log_string_post}" if logging

cppfile.puts " return m_pWrapped->#{name}(#{params_split.map {|p| p[1]}.join(", ")});"

cppfile.puts "}"

end

hfile.puts "};"

hfile.puts ""

} # close cppfile

} # close hfile

To use it, you specify the interface name, input header file, output file base name, and optionally whether you want logging information to be generated for each wrapped method.

For example, ruby wrapper_gen.rb IDirect3DTexture9 d3d9.h d3d9tex true would generate a wrapper for the IDirect3DTexture9 interface, get the information from d3d9.h, and store the generated wrapper on d3d9tex.h and d3d9tex.cpp. The implementations for the latter would include logging.

Here’s are the generated files for this test case.

d3d9tex.h:

// wrapper for IDirect3DTexture9 in d3d9.h
// generated using wrapper_gen.rb

#include "d3d9.h"

interface hkIDirect3DTexture9 : public IDirect3DTexture9 {
	IDirect3DTexture9 *m_pWrapped;

public:
	hkIDirect3DTexture9(IDirect3DTexture9 **ppIDirect3DTexture9);

	// original interface
	STDMETHOD(QueryInterface)(REFIID riid, void** ppvObj);
	STDMETHOD_(ULONG, AddRef)();
	STDMETHOD_(ULONG, Release)();
	STDMETHOD(GetDevice)(IDirect3DDevice9** ppDevice);
	STDMETHOD(SetPrivateData)(REFGUID refguid, CONST void* pData, DWORD SizeOfData, DWORD Flags);
	STDMETHOD(GetPrivateData)(REFGUID refguid, void* pData, DWORD* pSizeOfData);
	STDMETHOD(FreePrivateData)(REFGUID refguid);
	STDMETHOD_(DWORD, SetPriority)(DWORD PriorityNew);
	STDMETHOD_(DWORD, GetPriority)();
	STDMETHOD_(void, PreLoad)();
	STDMETHOD_(D3DRESOURCETYPE, GetType)();
	STDMETHOD_(DWORD, SetLOD)(DWORD LODNew);
	STDMETHOD_(DWORD, GetLOD)();
	STDMETHOD_(DWORD, GetLevelCount)();
	STDMETHOD(SetAutoGenFilterType)(D3DTEXTUREFILTERTYPE FilterType);
	STDMETHOD_(D3DTEXTUREFILTERTYPE, GetAutoGenFilterType)();
	STDMETHOD_(void, GenerateMipSubLevels)();
	STDMETHOD(GetLevelDesc)(UINT Level, D3DSURFACE_DESC *pDesc);
	STDMETHOD(GetSurfaceLevel)(UINT Level, IDirect3DSurface9** ppSurfaceLevel);
	STDMETHOD(LockRect)(UINT Level, D3DLOCKED_RECT* pLockedRect, CONST RECT* pRect, DWORD Flags);
	STDMETHOD(UnlockRect)(UINT Level);
	STDMETHOD(AddDirtyRect)(CONST RECT* pDirtyRect);
};

// wrapper for IDirect3DTexture9 in d3d9.h

// generated using wrapper_gen.rb

#include "d3d9.h"

interface hkIDirect3DTexture9 : public IDirect3DTexture9 {

IDirect3DTexture9 *m_pWrapped;

public:

hkIDirect3DTexture9(IDirect3DTexture9 **ppIDirect3DTexture9);

// original interface

STDMETHOD(QueryInterface)(REFIID riid, void** ppvObj);

STDMETHOD_(ULONG, AddRef)();

STDMETHOD_(ULONG, Release)();

STDMETHOD(GetDevice)(IDirect3DDevice9** ppDevice);

STDMETHOD(SetPrivateData)(REFGUID refguid, CONST void* pData, DWORD SizeOfData, DWORD Flags);

STDMETHOD(GetPrivateData)(REFGUID refguid, void* pData, DWORD* pSizeOfData);

STDMETHOD(FreePrivateData)(REFGUID refguid);

STDMETHOD_(DWORD, SetPriority)(DWORD PriorityNew);

STDMETHOD_(DWORD, GetPriority)();

STDMETHOD_(void, PreLoad)();

STDMETHOD_(D3DRESOURCETYPE, GetType)();

STDMETHOD_(DWORD, SetLOD)(DWORD LODNew);

STDMETHOD_(DWORD, GetLOD)();

STDMETHOD_(DWORD, GetLevelCount)();

STDMETHOD(SetAutoGenFilterType)(D3DTEXTUREFILTERTYPE FilterType);

STDMETHOD_(D3DTEXTUREFILTERTYPE, GetAutoGenFilterType)();

STDMETHOD_(void, GenerateMipSubLevels)();

STDMETHOD(GetLevelDesc)(UINT Level, D3DSURFACE_DESC *pDesc);

STDMETHOD(GetSurfaceLevel)(UINT Level, IDirect3DSurface9** ppSurfaceLevel);

STDMETHOD(LockRect)(UINT Level, D3DLOCKED_RECT* pLockedRect, CONST RECT* pRect, DWORD Flags);

STDMETHOD(UnlockRect)(UINT Level);

STDMETHOD(AddDirtyRect)(CONST RECT* pDirtyRect);

};

d3d9tex.cpp:

// wrapper for IDirect3DTexture9 in d3d9.h
// generated using wrapper_gen.rb

#include "d3d9tex.h"

hkIDirect3DTexture9::hkIDirect3DTexture9(IDirect3DTexture9 **ppIDirect3DTexture9) {
	m_pWrapped = *ppIDirect3DTexture9;
	*ppIDirect3DTexture9 = this;
}

HRESULT APIENTRY hkIDirect3DTexture9::QueryInterface(REFIID riid, void** ppvObj) {
	SDLOG(20, "hkIDirect3DTexture9::QueryInterface\n");
	return m_pWrapped->QueryInterface(riid, ppvObj);
}

ULONG APIENTRY hkIDirect3DTexture9::AddRef() {
	SDLOG(20, "hkIDirect3DTexture9::AddRef\n");
	return m_pWrapped->AddRef();
}

ULONG APIENTRY hkIDirect3DTexture9::Release() {
	SDLOG(20, "hkIDirect3DTexture9::Release\n");
	return m_pWrapped->Release();
}

HRESULT APIENTRY hkIDirect3DTexture9::GetDevice(IDirect3DDevice9** ppDevice) {
	SDLOG(20, "hkIDirect3DTexture9::GetDevice\n");
	return m_pWrapped->GetDevice(ppDevice);
}

HRESULT APIENTRY hkIDirect3DTexture9::SetPrivateData(REFGUID refguid, CONST void* pData, DWORD SizeOfData, DWORD Flags) {
	SDLOG(20, "hkIDirect3DTexture9::SetPrivateData\n");
	return m_pWrapped->SetPrivateData(refguid, pData, SizeOfData, Flags);
}

HRESULT APIENTRY hkIDirect3DTexture9::GetPrivateData(REFGUID refguid, void* pData, DWORD* pSizeOfData) {
	SDLOG(20, "hkIDirect3DTexture9::GetPrivateData\n");
	return m_pWrapped->GetPrivateData(refguid, pData, pSizeOfData);
}

HRESULT APIENTRY hkIDirect3DTexture9::FreePrivateData(REFGUID refguid) {
	SDLOG(20, "hkIDirect3DTexture9::FreePrivateData\n");
	return m_pWrapped->FreePrivateData(refguid);
}

DWORD APIENTRY hkIDirect3DTexture9::SetPriority(DWORD PriorityNew) {
	SDLOG(20, "hkIDirect3DTexture9::SetPriority\n");
	return m_pWrapped->SetPriority(PriorityNew);
}

DWORD APIENTRY hkIDirect3DTexture9::GetPriority() {
	SDLOG(20, "hkIDirect3DTexture9::GetPriority\n");
	return m_pWrapped->GetPriority();
}

void APIENTRY hkIDirect3DTexture9::PreLoad() {
	SDLOG(20, "hkIDirect3DTexture9::PreLoad\n");
	return m_pWrapped->PreLoad();
}

D3DRESOURCETYPE APIENTRY hkIDirect3DTexture9::GetType() {
	SDLOG(20, "hkIDirect3DTexture9::GetType\n");
	return m_pWrapped->GetType();
}

DWORD APIENTRY hkIDirect3DTexture9::SetLOD(DWORD LODNew) {
	SDLOG(20, "hkIDirect3DTexture9::SetLOD\n");
	return m_pWrapped->SetLOD(LODNew);
}

DWORD APIENTRY hkIDirect3DTexture9::GetLOD() {
	SDLOG(20, "hkIDirect3DTexture9::GetLOD\n");
	return m_pWrapped->GetLOD();
}

DWORD APIENTRY hkIDirect3DTexture9::GetLevelCount() {
	SDLOG(20, "hkIDirect3DTexture9::GetLevelCount\n");
	return m_pWrapped->GetLevelCount();
}

HRESULT APIENTRY hkIDirect3DTexture9::SetAutoGenFilterType(D3DTEXTUREFILTERTYPE FilterType) {
	SDLOG(20, "hkIDirect3DTexture9::SetAutoGenFilterType\n");
	return m_pWrapped->SetAutoGenFilterType(FilterType);
}

D3DTEXTUREFILTERTYPE APIENTRY hkIDirect3DTexture9::GetAutoGenFilterType() {
	SDLOG(20, "hkIDirect3DTexture9::GetAutoGenFilterType\n");
	return m_pWrapped->GetAutoGenFilterType();
}

void APIENTRY hkIDirect3DTexture9::GenerateMipSubLevels() {
	SDLOG(20, "hkIDirect3DTexture9::GenerateMipSubLevels\n");
	return m_pWrapped->GenerateMipSubLevels();
}

HRESULT APIENTRY hkIDirect3DTexture9::GetLevelDesc(UINT Level, D3DSURFACE_DESC *pDesc) {
	SDLOG(20, "hkIDirect3DTexture9::GetLevelDesc\n");
	return m_pWrapped->GetLevelDesc(Level, pDesc);
}

HRESULT APIENTRY hkIDirect3DTexture9::GetSurfaceLevel(UINT Level, IDirect3DSurface9** ppSurfaceLevel) {
	SDLOG(20, "hkIDirect3DTexture9::GetSurfaceLevel\n");
	return m_pWrapped->GetSurfaceLevel(Level, ppSurfaceLevel);
}

HRESULT APIENTRY hkIDirect3DTexture9::LockRect(UINT Level, D3DLOCKED_RECT* pLockedRect, CONST RECT* pRect, DWORD Flags) {
	SDLOG(20, "hkIDirect3DTexture9::LockRect\n");
	return m_pWrapped->LockRect(Level, pLockedRect, pRect, Flags);
}

HRESULT APIENTRY hkIDirect3DTexture9::UnlockRect(UINT Level) {
	SDLOG(20, "hkIDirect3DTexture9::UnlockRect\n");
	return m_pWrapped->UnlockRect(Level);
}

HRESULT APIENTRY hkIDirect3DTexture9::AddDirtyRect(CONST RECT* pDirtyRect) {
	SDLOG(20, "hkIDirect3DTexture9::AddDirtyRect\n");
	return m_pWrapped->AddDirtyRect(pDirtyRect);
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

// wrapper for IDirect3DTexture9 in d3d9.h

// generated using wrapper_gen.rb

#include "d3d9tex.h"

hkIDirect3DTexture9::hkIDirect3DTexture9(IDirect3DTexture9 **ppIDirect3DTexture9) {

m_pWrapped = *ppIDirect3DTexture9;

*ppIDirect3DTexture9 = this;

}

HRESULT APIENTRY hkIDirect3DTexture9::QueryInterface(REFIID riid, void** ppvObj) {

SDLOG(20, "hkIDirect3DTexture9::QueryInterface\n");

return m_pWrapped->QueryInterface(riid, ppvObj);

}

ULONG APIENTRY hkIDirect3DTexture9::AddRef() {

SDLOG(20, "hkIDirect3DTexture9::AddRef\n");

return m_pWrapped->AddRef();

}

ULONG APIENTRY hkIDirect3DTexture9::Release() {

SDLOG(20, "hkIDirect3DTexture9::Release\n");

return m_pWrapped->Release();

}

HRESULT APIENTRY hkIDirect3DTexture9::GetDevice(IDirect3DDevice9** ppDevice) {

SDLOG(20, "hkIDirect3DTexture9::GetDevice\n");

return m_pWrapped->GetDevice(ppDevice);

}

HRESULT APIENTRY hkIDirect3DTexture9::SetPrivateData(REFGUID refguid, CONST void* pData, DWORD SizeOfData, DWORD Flags) {

SDLOG(20, "hkIDirect3DTexture9::SetPrivateData\n");

return m_pWrapped->SetPrivateData(refguid, pData, SizeOfData, Flags);

}

HRESULT APIENTRY hkIDirect3DTexture9::GetPrivateData(REFGUID refguid, void* pData, DWORD* pSizeOfData) {

SDLOG(20, "hkIDirect3DTexture9::GetPrivateData\n");

return m_pWrapped->GetPrivateData(refguid, pData, pSizeOfData);

}

HRESULT APIENTRY hkIDirect3DTexture9::FreePrivateData(REFGUID refguid) {

SDLOG(20, "hkIDirect3DTexture9::FreePrivateData\n");

return m_pWrapped->FreePrivateData(refguid);

}

DWORD APIENTRY hkIDirect3DTexture9::SetPriority(DWORD PriorityNew) {

SDLOG(20, "hkIDirect3DTexture9::SetPriority\n");

return m_pWrapped->SetPriority(PriorityNew);

}

DWORD APIENTRY hkIDirect3DTexture9::GetPriority() {

SDLOG(20, "hkIDirect3DTexture9::GetPriority\n");

return m_pWrapped->GetPriority();

}

void APIENTRY hkIDirect3DTexture9::PreLoad() {

SDLOG(20, "hkIDirect3DTexture9::PreLoad\n");

return m_pWrapped->PreLoad();

}

D3DRESOURCETYPE APIENTRY hkIDirect3DTexture9::GetType() {

SDLOG(20, "hkIDirect3DTexture9::GetType\n");

return m_pWrapped->GetType();

}

DWORD APIENTRY hkIDirect3DTexture9::SetLOD(DWORD LODNew) {

SDLOG(20, "hkIDirect3DTexture9::SetLOD\n");

return m_pWrapped->SetLOD(LODNew);

}

DWORD APIENTRY hkIDirect3DTexture9::GetLOD() {

SDLOG(20, "hkIDirect3DTexture9::GetLOD\n");

return m_pWrapped->GetLOD();

}

DWORD APIENTRY hkIDirect3DTexture9::GetLevelCount() {

SDLOG(20, "hkIDirect3DTexture9::GetLevelCount\n");

return m_pWrapped->GetLevelCount();

}

HRESULT APIENTRY hkIDirect3DTexture9::SetAutoGenFilterType(D3DTEXTUREFILTERTYPE FilterType) {

SDLOG(20, "hkIDirect3DTexture9::SetAutoGenFilterType\n");

return m_pWrapped->SetAutoGenFilterType(FilterType);

}

D3DTEXTUREFILTERTYPE APIENTRY hkIDirect3DTexture9::GetAutoGenFilterType() {

SDLOG(20, "hkIDirect3DTexture9::GetAutoGenFilterType\n");

return m_pWrapped->GetAutoGenFilterType();

}

void APIENTRY hkIDirect3DTexture9::GenerateMipSubLevels() {

SDLOG(20, "hkIDirect3DTexture9::GenerateMipSubLevels\n");

return m_pWrapped->GenerateMipSubLevels();

}

HRESULT APIENTRY hkIDirect3DTexture9::GetLevelDesc(UINT Level, D3DSURFACE_DESC *pDesc) {

SDLOG(20, "hkIDirect3DTexture9::GetLevelDesc\n");

return m_pWrapped->GetLevelDesc(Level, pDesc);

}

HRESULT APIENTRY hkIDirect3DTexture9::GetSurfaceLevel(UINT Level, IDirect3DSurface9** ppSurfaceLevel) {

SDLOG(20, "hkIDirect3DTexture9::GetSurfaceLevel\n");

return m_pWrapped->GetSurfaceLevel(Level, ppSurfaceLevel);

}

HRESULT APIENTRY hkIDirect3DTexture9::LockRect(UINT Level, D3DLOCKED_RECT* pLockedRect, CONST RECT* pRect, DWORD Flags) {

SDLOG(20, "hkIDirect3DTexture9::LockRect\n");

return m_pWrapped->LockRect(Level, pLockedRect, pRect, Flags);

}

HRESULT APIENTRY hkIDirect3DTexture9::UnlockRect(UINT Level) {

SDLOG(20, "hkIDirect3DTexture9::UnlockRect\n");

return m_pWrapped->UnlockRect(Level);

}

HRESULT APIENTRY hkIDirect3DTexture9::AddDirtyRect(CONST RECT* pDirtyRect) {

SDLOG(20, "hkIDirect3DTexture9::AddDirtyRect\n");

return m_pWrapped->AddDirtyRect(pDirtyRect);

}

You can adjust the code generated for the logging in the Ruby script. As you can see, this can save you a lot of rote work, particularly if you want to intercept multiple large interfaces.

Update:

The original script didn’t deal with unnamed function parameters correctly. Now it should.

C++11 chrono timers

Posted on 2013-06-27 by petert

I’m a pretty big proponent of C++ as a language, and particularly enthused about C++11 and how that makes it even better. However, sadly reality still lags a bit behind specification in many areas.

One thing that was always troublesome in C++, particularly in high performance or realtime programming, was that there was no standard, platform independent way of getting a high performance timer. If you wanted cross-platform compatibility and a small timing period, you had to go with some external library, go OpenMP or roll your own on each supported platform.

In C++11, the chrono namespace was introduced. It, at least in theory, provides everything you always wanted in terms of timing, right there in the standard library. Three different types of clocks are offered for different use cases: system_clock , steady_clock and high_resolution_clock.

Yesterday I wrote a small program to query and test these clocks in practice on different platforms. Here are the results:

============================================
Linux, GCC 4.8.1
--------------------------------------------

Clock info for High Resolution Clock:
period: 1 ns
unit: 1 ns
Steady: false

Clock info for Steady Clock:
period: 1 ns
unit: 1 ns
Steady: true

Clock info for System Clock:
period: 1 ns
unit: 1 ns
Steady: false

Time/iter, no clock: 1 ns
Time/iter, clock: 120 ns
Min time delta: 110 ns

============================================
Windows, Visual Studio 2012
--------------------------------------------

Clock info for High Resolution Clock:
period: 100 ns
unit: 100 ns
Steady: false

Clock info for Steady Clock:
period: 100 ns
unit: 100 ns
Steady: true

Clock info for System Clock:
period: 100 ns
unit: 100 ns
Steady: false

Time/iter, no clock: 2 ns
Time/iter, clock: 9 ns
Min time delta: 1000000 ns

============================================

Linux, GCC 4.8.1

--------------------------------------------

Clock info for High Resolution Clock:

period: 1 ns

unit: 1 ns

Steady: false

Clock info for Steady Clock:

period: 1 ns

unit: 1 ns

Steady: true

Clock info for System Clock:

period: 1 ns

unit: 1 ns

Steady: false

Time/iter, no clock: 1 ns

Time/iter, clock: 120 ns

Min time delta: 110 ns

============================================

Windows, Visual Studio 2012

--------------------------------------------

Clock info for High Resolution Clock:

period: 100 ns

unit: 100 ns

Steady: false

Clock info for Steady Clock:

period: 100 ns

unit: 100 ns

Steady: true

Clock info for System Clock:

period: 100 ns

unit: 100 ns

Steady: false

Time/iter, no clock: 2 ns

Time/iter, clock: 9 ns

Min time delta: 1000000 ns

So, sadly everything is not as great as it could be, yet. For each platform, the first three blocks are the values reported for the clock, and the last block contains values determined by repeated measurements:

“period” is the tick period reported by each clock, in nanoseconds.
“unit” is the unit used by clock values, also in nanoseconds.
“steady” indicates whether the time between ticks is always constant for the given clock.
“time/iter, no clock” is the time per loop iteration for the measurement loop without the actual measurement. It’s just a reference value to better judge the overhead of the clock measurements.
“time/iter, clock” is the average time per iteration, with clock measurement.
“min time delta” is the minimum difference between two consecutive, non-identical time measurements.

On Linux with GCC 4.8.1, all clocks report a tick period of 1 nanosecond. There isn’t really a reason to doubt that, and it’s obviously a great granularity. However, the drawback is that it takes around 120 nanoseconds on average to get a clock measurement. This would be understandable for the system clock, but seems excessive in the other cases, and could cause significant perturbation when trying to measure/instrument small code areas.

On Windows with VS12, a clock period of 100 nanoseconds is reported, but the actual measured tick period is a whopping 1000000 ns (1 millisecond). That is obviously unusable for many of the kind of use cases that would call for a “high resolution clock”. Windows is perfectly capable of supplying a true high resolution clock measurement, so this performance (or lack of it) is quite surprising. On the bright side, a measurement takes just 9 nanoseconds on average.

Clearly, both implementations tested here still have a way to go. If you want to test your own platform(s), here is the very simple program:

#include <chrono>
#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>
using namespace std;

template<typename C>
void print_clock_info(const char* name, const C& c) {
	typename C::duration unit(1); 
	typedef typename C::period period;
	cout << "Clock info for " << name << ":\n"
		 << "period: " << period::num*1000000000ull / period::den << " ns \n"
		 << "unit: " << chrono::duration_cast<chrono::nanoseconds>(unit).count() << " ns \n"
		 << "Steady: " << (c.is_steady?"true":"false") << "\n\n";
}

int main(int argc, char** argv) {
	chrono::high_resolution_clock highc;
	chrono::steady_clock steadyc;
	chrono::system_clock sysc;

	print_clock_info("High Resolution Clock", highc);
	print_clock_info("Steady Clock", steadyc);
	print_clock_info("System Clock", sysc);

	const long long iters = 10000000;

	vector<long long> vec(iters); 
	auto ref_start = highc.now();
	for(int i=0; i<iters; ++i) {
		vec[i] = i;
	}
	cout << "Time/iter, no clock: " << chrono::duration_cast<chrono::nanoseconds>(highc.now()-ref_start).count()/iters << " ns\n";

	auto start = highc.now();
	for(int i=0; i<iters; ++i) {
		auto time = chrono::duration_cast<chrono::nanoseconds>(highc.now()-start).count();
		vec[i] = time;
	}
	cout << "Time/iter, clock: " << chrono::duration_cast<chrono::nanoseconds>(highc.now()-start).count()/iters << " ns\n";

	auto end = unique(vec.begin(), vec.end());
	adjacent_difference(vec.begin(), end, vec.begin());
	auto min = *min_element(vec.begin()+1, end);
	cout << "Min time delta: " << min << " ns\n";
}

#include <chrono>

#include <iostream>

#include <vector>

#include <algorithm>

#include <numeric>

using namespace std;

template<typename C>

void print_clock_info(const char* name, const C& c) {

typename C::duration unit(1);

typedef typename C::period period;

cout << "Clock info for " << name << ":\n"

<< "period: " << period::num*1000000000ull / period::den << " ns \n"

<< "unit: " << chrono::duration_cast<chrono::nanoseconds>(unit).count() << " ns \n"

<< "Steady: " << (c.is_steady?"true":"false") << "\n\n";

}

int main(int argc, char** argv) {

chrono::high_resolution_clock highc;

chrono::steady_clock steadyc;

chrono::system_clock sysc;

print_clock_info("High Resolution Clock", highc);

print_clock_info("Steady Clock", steadyc);

print_clock_info("System Clock", sysc);

const long long iters = 10000000;

vector<long long> vec(iters);

auto ref_start = highc.now();

for(int i=0; i<iters; ++i) {

vec[i] = i;

}

cout << "Time/iter, no clock: " << chrono::duration_cast<chrono::nanoseconds>(highc.now()-ref_start).count()/iters << " ns\n";

auto start = highc.now();

for(int i=0; i<iters; ++i) {

auto time = chrono::duration_cast<chrono::nanoseconds>(highc.now()-start).count();

vec[i] = time;

}

cout << "Time/iter, clock: " << chrono::duration_cast<chrono::nanoseconds>(highc.now()-start).count()/iters << " ns\n";

auto end = unique(vec.begin(), vec.end());

adjacent_difference(vec.begin(), end, vec.begin());

auto min = *min_element(vec.begin()+1, end);

cout << "Min time delta: " << min << " ns\n";

}

Implementing your own synchronisation primitives is tricky

Posted on 2013-02-20 by petert

This blog post is about the folly of implementing your own synchronization primitives without thinking about what compilers are allowed to do. If you’re not into low-level C/x64 parallelism programming then you can safely skip it

In the Insieme project, we use one double-ended work stealing queue per hardware thread which can be independently accessed (read and write) at both ends. It’s implemented as a circular buffer with a 64 bit control word.

// =========== Circular work buffers
// front = top, back = bottom
// top INCLUSIVE, bottom EXCLUSIVE
// Length needs to be a power of 2!
//
//  8 |       |
//    |-------|
//  7 |       |  <- top_update
//  6 |       |
//    |-------|
//  5 |#######|  <- top_val
//  4 |#######|
//  3 |#######|
//    |-------|
//  2 |       |  <- bot_val
//  1 |       |
//    |-------|
//  0 |       |  <- bot_update

typedef union _irt_cwb_state {
	uint64 all;
	struct {
		union {
			uint32 top;
			struct {
				uint16 top_val;
				uint16 top_update;
			};
		};
		union {
			uint32 bot;
			struct {
				uint16 bot_val;
				uint16 bot_update;
			};
		};
	};
} irt_cwb_state;

typedef struct _irt_circular_work_buffer {
	volatile irt_cwb_state state;
	irt_work_item* items[IRT_CWBUFFER_LENGTH];
} irt_circular_work_buffer;

#define IRT_CWBUFFER_LENGTH 32
#define IRT_CWBUFFER_MASK (IRT_CWBUFFER_LENGTH-1)

// =========== Circular work buffers

// front = top, back = bottom

// top INCLUSIVE, bottom EXCLUSIVE

// Length needs to be a power of 2!

// 8 | |

// |-------|

// 7 | | <- top_update

// 6 | |

// |-------|

// 5 |#######| <- top_val

// 4 |#######|

// 3 |#######|

// |-------|

// 2 | | <- bot_val

// 1 | |

// |-------|

// 0 | | <- bot_update

typedef union _irt_cwb_state {

uint64 all;

struct {

union {

uint32 top;

struct {

uint16 top_val;

uint16 top_update;

};

union {

uint32 bot;

struct {

uint16 bot_val;

uint16 bot_update;

};

} irt_cwb_state;

typedef struct _irt_circular_work_buffer {

volatile irt_cwb_state state;

irt_work_item* items[IRT_CWBUFFER_LENGTH];

} irt_circular_work_buffer;

#define IRT_CWBUFFER_LENGTH 32

#define IRT_CWBUFFER_MASK (IRT_CWBUFFER_LENGTH-1)

The original code for adding a new item to this queue looked something like this:

void irt_cwb_push_front(irt_circular_work_buffer* wb, irt_work_item* wi) {
	// check feasibility
	irt_cwb_state state, newstate;
	for(;;) {
		state.all = wb->state.all;
		if(state.top_update != state.top_val) continue; // operation in progress on top
		// check for space
		newstate.all = state.all;
		newstate.top_update = (newstate.top_update+1) & IRT_CWBUFFER_MASK;
		if(newstate.top_update == state.bot_update 
			|| newstate.top_update == state.bot_val) continue; // not enough space in buffer, would be full after op
		// if we reach this point and no changes happened, we can perform our op
		if(irt_atomic_bool_compare_and_swap(&wb->state.all, state.all, newstate.all)) break; // repeat if state change since check
	}

	// write actual data to buffer
	wb->items[newstate.top_update] = wi;
	// finish operation
	wb->state.top_val = newstate.top_update;
}

void irt_cwb_push_front(irt_circular_work_buffer* wb, irt_work_item* wi) {

// check feasibility

irt_cwb_state state, newstate;

for(;;) {

state.all = wb->state.all;

if(state.top_update != state.top_val) continue; // operation in progress on top

// check for space

newstate.all = state.all;

newstate.top_update = (newstate.top_update+1) & IRT_CWBUFFER_MASK;

if(newstate.top_update == state.bot_update

|| newstate.top_update == state.bot_val) continue; // not enough space in buffer, would be full after op

// if we reach this point and no changes happened, we can perform our op

if(irt_atomic_bool_compare_and_swap(&wb->state.all, state.all, newstate.all)) break; // repeat if state change since check

}

// write actual data to buffer

wb->items[newstate.top_update] = wi;

// finish operation

wb->state.top_val = newstate.top_update;

}

Now, this generally worked fine in practice, but in unit tests around ever 21 millionth insertion failed. After chasing a few wrong leads I figured out that setting newstate to volatile fixed the issue. The problem with this, of course, is that it makes no sense. It’s a local variable stored on the stack of the executing thread – it can not be accessed by any other thread.

In the end, to understand the issue, looking into the generated assembler code for both versions was required. Here’s what gcc does in the nonvolatile version:

.LVL402:
.LBB2802:
	.loc 15 80 0
	movabs	r8, -4294901761
	.p2align 4,,10
	.p2align 3
.L409:
	.loc 15 76 0
	mov	rax, QWORD PTR [rdi]
	.loc 15 77 0
	mov	rdx, rax                 ; copy state
	shr	rdx, 16                  ; move top_update right
	cmp	dx, ax                   ; compare top_update w/ top_val
	jne	.L409                    ; if not equal restart
	.loc 15 80 0
	add	edx, 1                   ; top_update+1
	mov	rcx, rax                 ; get original state to c
	and	edx, 15                  ; & IRT_CWBUFFER_MASK
	and	rcx, r8                  ; delete bits 17 to 32 from original state in rcx
	movzx  r9d, dx                  ; r9d = lower 32 bits / zero extend top_update into 32 bits of r9
	sal	r9, 16                   ; move top_val back into position
	or     rcx, r9                  ; enter new top_update into original state
	.loc 15 81 0
	mov	r9, rax
	shr	r9, 48                   ; get bot_update in r9
	cmp	dx, r9w                  ; compare bot_update with newstate.top_update
	je     .L409                    ; no space, retry
	mov	r9, rax                  ; \
	shr	r9, 32                   ;  -> same story, bot_val
	cmp	dx, r9w                  ; /
	je	.L409 
	.loc 15 84 0
	lock cmpxchg	QWORD PTR [rdi], rcx
	jne	.L409
	.loc 15 88 0
	mov	rax, rdx
	.loc 15 90 0
	mov	WORD PTR [rdi], dx
	.loc 15 88 0
	and	eax, 15
	mov	QWORD PTR [rdi+8+rax*8], rsi
.LBE2802:
	.loc 15 91 0
	ret

.LVL402:

.LBB2802:

.loc 15 80 0

movabs r8, -4294901761

.p2align 4,,10

.p2align 3

.L409:

.loc 15 76 0

mov rax, QWORD PTR [rdi]

.loc 15 77 0

mov rdx, rax ; copy state

shr rdx, 16 ; move top_update right

cmp dx, ax ; compare top_update w/ top_val

jne .L409 ; if not equal restart

.loc 15 80 0

add edx, 1 ; top_update+1

mov rcx, rax ; get original state to c

and edx, 15 ; & IRT_CWBUFFER_MASK

and rcx, r8 ; delete bits 17 to 32 from original state in rcx

movzx r9d, dx ; r9d = lower 32 bits / zero extend top_update into 32 bits of r9

sal r9, 16 ; move top_val back into position

or rcx, r9 ; enter new top_update into original state

.loc 15 81 0

mov r9, rax

shr r9, 48 ; get bot_update in r9

cmp dx, r9w ; compare bot_update with newstate.top_update

je .L409 ; no space, retry

mov r9, rax ; \

shr r9, 32 ; -> same story, bot_val

cmp dx, r9w ; /

je .L409

.loc 15 84 0

lock cmpxchg QWORD PTR [rdi], rcx

jne .L409

.loc 15 88 0

mov rax, rdx

.loc 15 90 0

mov WORD PTR [rdi], dx

.loc 15 88 0

and eax, 15

mov QWORD PTR [rdi+8+rax*8], rsi

.LBE2802:

.loc 15 91 0

ret

And here’s the volatile one:

.L409:
.LBB2766:
	.loc 16 76 0
	mov	rax, QWORD PTR [rdi]
	.loc 16 77 0
	mov	rdx, rax
	shr	rdx, 16
	cmp	dx, ax
	jne	.L409
	.loc 16 79 0
	mov	QWORD PTR [rsp-16], rax
	.loc 16 80 0
	movzx	edx, WORD PTR [rsp-14]
	add	edx, 1
	and	edx, 15
	mov	WORD PTR [rsp-14], dx
	.loc 16 81 0
	movzx	ecx, WORD PTR [rsp-14]
	mov	rdx, rax
	shr	rdx, 48
	cmp	cx, dx
	je	.L409
	movzx	ecx, WORD PTR [rsp-14]
	mov	rdx, rax
	shr	rdx, 32
	cmp	cx, dx
	je	.L409
	.loc 16 84 0
	mov	rdx, QWORD PTR [rsp-16]
	lock cmpxchg	QWORD PTR [rdi], rdx
	jne	.L409
	.loc 16 88 0
	movzx	eax, WORD PTR [rsp-14]
	movzx	eax, ax
	mov	QWORD PTR [rdi+8+rax*8], rsi
	.loc 16 90 0
	movzx	eax, WORD PTR [rsp-14]
	mov	WORD PTR [rdi], ax
.LBE2766:
	.loc 16 91 0
	ret

.L409:

.LBB2766:

.loc 16 76 0

mov rax, QWORD PTR [rdi]

.loc 16 77 0

mov rdx, rax

shr rdx, 16

cmp dx, ax

jne .L409

.loc 16 79 0

mov QWORD PTR [rsp-16], rax

.loc 16 80 0

movzx edx, WORD PTR [rsp-14]

add edx, 1

and edx, 15

mov WORD PTR [rsp-14], dx

.loc 16 81 0

movzx ecx, WORD PTR [rsp-14]

mov rdx, rax

shr rdx, 48

cmp cx, dx

je .L409

movzx ecx, WORD PTR [rsp-14]

mov rdx, rax

shr rdx, 32

cmp cx, dx

je .L409

.loc 16 84 0

mov rdx, QWORD PTR [rsp-16]

lock cmpxchg QWORD PTR [rdi], rdx

jne .L409

.loc 16 88 0

movzx eax, WORD PTR [rsp-14]

movzx eax, ax

mov QWORD PTR [rdi+8+rax*8], rsi

.loc 16 90 0

movzx eax, WORD PTR [rsp-14]

mov WORD PTR [rdi], ax

.LBE2766:

.loc 16 91 0

ret

As you can see from the comments in the first version, we started interpreting the assembly from the top. That was a mistake. If you look at the last few lines, you can see the culprit. The line mov QWORD PTR [rdi+8+rax*8], rsi corresponds to wb->items[newstate.top_update] = wi; . In the non-volatile version, gcc decides to move that line below the unlocking of the data structure. This is a perfectly valid transformation, since there are no dependencies between the two lines (gcc is obviously unaware of any parallelism going on).

There are many ways to fix the issue: add a memory barrier ( __sync_synchronize in gcc), do the assignment using an atomic exchange operation, or if you want to stay in pure C: (wb->items[newstate.top_update] = wi) && (wb->state.top_val = newstate.top_update); . Which is admittedly ugly, and only works since wi is never NULL . Sadly, all of these options have a slight performance penalty. If anyone knows any other portable way to enforce the ordering of operations in this case, I’d be happy to hear about it.

And that’s it, more or less. Lessons learned: take care when implementing your own synchronizations. If you think you are taking care, take more care. And when comparing assembly, look at the obvious differences before starting to interpret the code top down.

metaclassofnilblog

Tag Archives: C

Hexer – A hex editor

Dark Dreams Don’t Die (D4) Alternative Launcher

Graveyard of Forgotten Projects (Part 2)

Rudloid

Wasser

Ragex

Crystalise

LuXr

Wrapper_gen, a wrapper generator for COM interfaces

Update:

C++11 chrono timers

Implementing your own synchronisation primitives is tricky