- The tracing intermediate representation has now operations for SIMD instructions (named vec_XYZ) and vector variables
- The proposed algorithm was implemented in the optimization backend of the JIT compiler
- Guard strength reduction that handles arithmetic arguments.
- Routines to build a dependency graph and reschedule the trace
- Extended the backend to emit SSE4.1 SIMD instructions for the new tracing operations
- Ran some benchmark programs and evaluated the current gain

*numpy.sum*or

*numpy.prod*can be executed with SIMD instructions.

Here is a preview of trace loop speedup the optimization currently achieves.

Note that the setup for all programs is the following: Create two vectors (or one for the last three) (10.000 elements) and execute the operation (e.g. multiply) on the specified datatype. It would look something similar to:

a = np.zeros(10000,dtype='float')

b = np.ones(10000,dtype='float')

np.multiply(a,b,out=a)

After about 1000 iterations of multiplying the tracing JIT records and optimizes the trace. Before jumping to and after exiting the trace the time is recorded. The difference you see in the plot above. Note there is still a problem with any/all and that this is only a micro benchmark. It does not necessarily tell anything about the whole runtime of the program.

For multiply-float64 the theoretical maximum speedup is nearly achieved!

Expectations for the next two months

One thing I'm looking forward to is the Python conference in Bilbao. I have not met my mentors and other developers yet. This will be awesome!

I have also been promised that we will take a look at the optimization so that I can further improve the optimization.

To get even better result I will also need to restructure some parts in the Micro-NumPy library in PyPy.

I think I'm quite close to the end of the implementation (because I started in February already) and I expect that the rest of the GSoC program I will extend, test, polish, restructure and benchmark the optimization.

## No comments:

## Post a Comment