Wrapper_gen, a wrapper generator for COM interfaces

DSfix was based on a Direct3D9 wrapper, which was mostly taken from an existing code base and extended manually.

Recently, I’ve needed to hook Direct3D9Ex, and came to the conclusion that the manual busy work of writing the initial wrapper is better left to a computer than a human. Therefore, I wrote a Ruby script which takes a Microsoft COM dll header interface specification, and generates the C++ code for a wrapper class for it.

Here’s the script (wrapper_gen.rb), it’s rather tiny:

To use it, you specify the interface name, input header file, output file base name, and optionally whether you want logging information to be generated for each wrapped method.

For example, ruby wrapper_gen.rb IDirect3DTexture9 d3d9.h d3d9tex true would generate a wrapper for the IDirect3DTexture9 interface, get the information from d3d9.h, and store the generated wrapper on d3d9tex.h and d3d9tex.cpp. The implementations for the latter would include logging.

Here’s are the generated files for this test case.

d3d9tex.h:

d3d9tex.cpp:

You can adjust the code generated for the logging in the Ruby script. As you can see, this can save you a lot of rote work, particularly if you want to intercept multiple large interfaces.

Update:

The original script didn’t deal with unnamed function parameters correctly. Now it should.

 

PtBi update and source release

I just released a new version of PtBi (5.1729). It’s a minor update that adds a few small features people were asking for:

  • A nearest neighbour scaling mode.
  • The ability to bind keys to switch directly to a given AA or scaling mode (instead of going through the available modes step by step). See keys.ini for details and some examples.

More importantly, I also uploaded an initial commit of the PtBi source to GitHub. It’s probably a bit hard to get to build initially due to the dependencies, but I hope it is useful for someone.

 

C++11 chrono timers

I’m a pretty big proponent of C++ as a language, and particularly enthused about C++11 and how that makes it even better. However, sadly reality still lags a bit behind specification in many areas.

One thing that was always troublesome in C++, particularly in high performance or realtime programming, was that there was no standard, platform independent way of getting a high performance timer. If you wanted cross-platform compatibility and a small timing period, you had to go with some external library, go OpenMP or roll your own on each supported platform.

In C++11, the chrono namespace was introduced. It, at least in theory, provides everything you always wanted in terms of timing, right there in the standard library. Three different types of clocks are offered for different use cases: system_clock ,  steady_clock  and high_resolution_clock.

Yesterday I wrote a small program to query and test these clocks in practice on different platforms. Here are the results:

So, sadly everything is not as great as it could be, yet. For each platform, the first three blocks are the values reported for the clock, and the last block contains values determined by repeated measurements:

  • “period” is the tick period reported by each clock, in nanoseconds.
  • “unit” is the unit used by clock values, also in nanoseconds.
  • “steady” indicates whether the time between ticks is always constant for the given clock.
  • “time/iter, no clock” is the time per loop iteration for the measurement loop without the actual measurement. It’s just a reference value to better judge the overhead of the clock measurements.
  • “time/iter, clock” is the average time per iteration, with clock measurement.
  • “min time delta” is the minimum difference between two consecutive, non-identical time measurements.

On Linux with GCC 4.8.1, all clocks report a tick period of 1 nanosecond. There isn’t really a reason to doubt that, and it’s obviously a great granularity. However, the drawback is that it takes around 120 nanoseconds on average to get a clock measurement. This would be understandable for the system clock, but seems excessive in the other cases, and could cause significant perturbation when trying to measure/instrument small code areas.

On Windows with VS12, a clock period of 100 nanoseconds is reported, but the actual measured tick period is a whopping 1000000 ns (1 millisecond). That is obviously unusable for many of the kind of use cases that would call for a “high resolution clock”. Windows is perfectly capable of supplying a true high resolution clock measurement, so this performance (or lack of it) is quite surprising. On the bright side, a measurement takes just 9 nanoseconds on average.

Clearly, both implementations tested here still have a way to go. If you want to test your own platform(s), here is the very simple program:

 

PtBi version 5

I just released a new major version of PtBi, with 2 new features.

Dolby Digital 5.1 decoding

PtBi can now decode audio streams transmitted in Dolby Digital 5.1 format. Together with the existing DTS 5.1 decoding, this should now allow for true surround sound from almost any source. I believe that PtBi is the only Blackmagic Intensity capture program with this type of audio support.
This was easier than I expected, at least at first, because the decoding library functions very similarly to the one I used for DTS, but I was stuck for hours without any progress. It turns out that someone thought it would be a good idea to standardize a bitstream format such that it can be either big-endian or little-endian. Ugh.

SMAA integration

In addition to the existing FXAA, PXAA and TPXAA post-processing AA modes PtBi now also supports SMAA1x. SMAA1x has slightly better edge quality and motion stability than FXAA. I’ll look into integrating SMAA with my predication filters at some point in the future.

 

Also, I plan to release the source code for PtBi soon-ish. I was always reluctant to do this, since some of it is based on code I wrote almost a decade ago which is pretty terrible, but I cleaned it up slightly now. And some parts of it, like how to integrate the AA modes in OpenGL or how to use the various libraries for audio decoding/playback might be useful to someone. Also, it could help people identify and solve problems with AMD cards, which are always very hard for me to test/debug without access to the hardware.

Texture Scaling in Emulators

PPSSPP is a great PSP emulator for all kinds of platforms, including Windows and Android. I recently started using it to play some of my PSP games, and I was surprised how nice a few of them (particularly the stylized ones) can look with some AA and a higher rendering resolution.

However, the texture resolution on many of the games is a huge blemish on the visuals. Look at this example (from Fate: Extra): DefaultParticularly the hair is absurdly pixelized, but the clothing and tree texture aren’t much better. In general, trying to make a higher resolution image from a lower resolution one is a fool’s errand, as the information just isn’t there. However, for stylized textures such as these I thought something might be done.

The first idea was to use HQ4x, an image scaling algorithm designed for pixel art. Hacking that into PPSSPP yielded the following result:
hq4xAs you can see, it was pretty effective on the hard transparency edges of the hair and tree textures, but only increased the pixelation on the soft, anti-aliased edges of the cloth.

Luckily, scaling of image art has advanced quite a bit since HQx was created, and I soon found an algorithm called xBR created by Hyllian on the byuu.org message boards. The source code for xBRZ, a slightly improved and parallelizable implementation of xBR is available as part of the HqMAME project. It deals much better with anti-aliased edges, and integrating it into PPSSPP ended up looking like this:xbrzIt’s a generally great result, and better than HQ4x, with one drawback: the posterization of gradients. It’s not too apparent in the image above, but it can be very distracting in other scenes and games (e.g. it can look really bad in sky textures).

To circumvent that effect I had to take to Matlab. I came up with an algorithm that calculates a mask based on the local contrast of a texture, and then chooses between xBRZ and bilinear/bicubic texture scaling based on the mask value.
maskPutting all of that together, and adding an additional deposterization step which improves the quality of compressed textures, I arrived at this:
hybridThe initial version was very slow, particularly with bicubic scaling. So I also parallelized everything and added a SSE 4.1 version of the scaling function. You can try the final result in any recent build of PPSSPP.

There are still many things that could be explored for even better automatic texture scaling in emulators. One particular deficiency of xBR for texture scaling is how it deals with the borders of images. It simply assumes that the texture continues as on the border (i.e. replicates it). A better idea for textures could be to assume that the edge direction continues as it does on the border – this could reduce some tiling artifacts that appear when scaling.

Another interesting topic would be the replication of noise or small-scale detail on an upscaled texture, but it would require some in-depth analysis of the texture images which might not be feasible in real-time.

Oculus Rift

Two days ago I received my Oculus Rift developer kit. If you’re unfamiliar with the Rift, it’s an affordable Virtual Reality headset that had a successful kickstarter for developer kits last year.

My kit had a pretty long journey, going to Australia first. I used to think that people (particularly in the US) mixing up Austria and Australia was just a myth, but it seems like it actually happens:

Mislabeled Package

Tracking Information for the UPS order

Tracking Information for the UPS order

But hey, all is well that ends well. It’s a really nicely packaged kit, and includes adapters for anywhere on earth and 3 times as many video cables as you need:

Box

You can find much better pictures of exactly what’s inside (and the great box!) elsewhere on the web.

Sadly, I don’t have much time to do development for the Rift or even much testing right now, but here are my first impressions:

  • It works! When you first put it on and look around, it really feels like an entirely new experience. I had a few people at work try it today, and all were really impressed as well.
  • The resolution is low, but not as bad as I expected. I think with the consumer version’s planned 1080p resolution and really nicely anti-aliased rendering, we’ll be fine for a while.
  • The pixel switching time of the current display is too long. Ideally, I think it should use something like an OLED display, with instant response.
  • The headtracking is really fast, I didn’t notice any perceptible delay.

I just tested using the “Oculus World Demo” included with the SDK, and I noticed that the reaction speed and even the blur with head movement seemed significantly better with the windowed fullscreen mode instead of the “real” fullscreen mode. I’m not sure why this is the case, it could be that in real fullscreen I had VSync on.

Anyway, I hope I get more time to play around with it this weekend.

 

Implementing your own synchronisation primitives is tricky

This blog post is about the folly of implementing your own synchronization primitives without thinking about what compilers are allowed to do. If you’re not into low-level C/x64 parallelism programming then you can safely skip it ;)

In the Insieme project, we use one double-ended work stealing queue per hardware thread which can be independently accessed (read and write) at both ends. It’s implemented as a circular buffer with a 64 bit control word.

The original code for adding a new item to this queue looked something like this:

Now, this generally worked fine in practice, but in unit tests around ever 21 millionth insertion failed. After chasing a few wrong leads I figured out that setting  newstate  to volatile  fixed the issue. The problem with this, of course, is that it makes no sense. It’s a local variable stored on the stack of the executing thread – it can not be accessed by any other thread.

In the end, to understand the issue, looking into the generated assembler code for both versions was required. Here’s what gcc does in the nonvolatile version:

And here’s the volatile one:

As you can see from the comments in the first version, we started interpreting the assembly from the top. That was a mistake. If you look at the last few lines, you can see the culprit. The line mov QWORD PTR [rdi+8+rax*8], rsi  corresponds to wb->items[newstate.top_update] = wi; . In the non-volatile version, gcc decides to move that line below the unlocking of the data structure. This is a perfectly valid transformation, since there are no dependencies between the two lines (gcc is obviously unaware of any parallelism going on).

There are many ways to fix the issue: add a memory barrier ( __sync_synchronize in gcc), do the assignment using an atomic exchange operation, or if you want to stay in pure C: (wb->items[newstate.top_update] = wi) && (wb->state.top_val = newstate.top_update); . Which is admittedly ugly, and only works since wi is never NULL . Sadly, all of these options have a slight performance penalty. If anyone knows any other portable way to enforce the ordering of operations in this case, I’d be happy to hear about it.

And that’s it, more or less. Lessons learned: take care when implementing your own synchronizations. If you think you are taking care, take more care. And when comparing assembly, look at the obvious differences before starting to interpret the code top down.

PtBi 4.1516

I just fixed the crash bug in PtBi introduced with the latest NVidia WHQL drivers.

If anyone from NV is reading this, I really don’t think having a:

Should cause the shader compiler to spit out this:

It works just fine without the “restrict”.

Anyway, if you’re using PtBi with a NV GPU then you can find an updated, working version on the PtBi homepage. Sorry for the delay in fixing this.

DSfix 2.0.1

With 2.0 yesterday I introduced an issue with the HUD modifications. I fixed it now. That’s all that has changed.

People are also reporting some stability problems and physics issues since the patch, but I’m not sure those are related to DSfix. On the bright side, it seems like in addition to fixing the stereo downmix, the patch also somewhat reduced the CPU load of the game.

As always, consider donating if you like the mod.

Get DSfix 2.0.1 here.

Edit: Mediafire decided to take the file down for some reason, here is a mirror.

You can also always get DSfix at the Dark Souls Nexus.

DSfix 2.0

Dark Souls was updated today, fixing the audio downmixing bug that had been present since launch (and maybe more?). Unfortunately, it also broke some features of DSfix, most significantly the FPS unlocking.

Well, with a lot of help from Clément Barnier, here is version 2.0 of DSfix which resolves these issues and adds a small new feature.

Changes:

  • Updated the framerate unlock feature to work with the patched version of the game (Nwks)
  • Updated post-processing AA to work with the patched version of the game
  • Fixed an issue where hudless screenshots would sometimes not correctly capture some effects
  • Added “presentWidth” and “presentHeight” to the .ini for full control over (windowed) downsampling. For example, if you want to downsample from 2560×1440 to 1080p, you would use renderWidth 2560, renderHeight 1440, presentWidth 1920 and presentHeight 1080. If none of that makes sense to you just leave these values at 0 ;)

I hope this allows you to enjoy Dark Souls in its full glory again. Happy holidays!

As always, consider donating if you like the mod.

Get DSfix 2.0 here.

It’s 4 am here now so if I messed up anything in this release it will have to wait until tomorrow.