First look at automatic XF2Python

In my previous blog I had a first look on what Excel Formula, or Excel Macro 4.0 – would look like in Python 3.x. I manually wrote support for the first layer of that sample. Little did I know how complicated it would become to do this automatic, and what secrets the sample had further down.

To do this research I wrote a tiny debugger-interface inside xf2python; just to allow access to the world as it got a bit complicated.

When you first start (same sample as previous blog) you get this:

It first reads the entire workbook stream and parses all the data. It then goes through all macro-sheets, parses the formula expressions and shows you the textual representation of them. It now wants to start to convert these to Python 3.x code:

xf2python then uses the Python AST library to compile and run this Python code within the application (recursively). This caused me a lot of pain trying to find out how to do this in a clean way. I almost went for creating my own VM, but then it wouldn’t be xf2python and I am quite stubborn. That said, it might be that a custom VM running the XF AST is a better approach. Problem there is that the decoded Formula statements deliver text, so another challenge is then added to translate the textual representation back to AST.

After the run is complete it accesses the locals of the compiled application to find the table of “FORMULA” entries that the runtime added to a list in side the “” “FORMULA” API.

Now it starts to convert this code to Python 3.x. Mind you, there is no real “formula parsed expressions/AST” for this as they are just strings – which is why I do the conversion from XF/XML to Python always on the textual representation.

The probably hardest conversions are lines that contains many expressions, like:

formula R[35]C[0] =SET.VALUE(R21C4,202)
formula R[36]C[0] =SET.VALUE(R21C4,IF(SIN(LEN(GET.WORKSPACE(1)))<2,IF(RESET.TOOLBAR(1),1,100),100))

As you can see, it sets R21C4 to 202, and then it sets the same value to a “IF(SIN(LEN(GET.WORKSPACE(1)))<2,IF(RESET.TOOLBAR(1),1,100),100)”, which has another IF/ELSE embedded. You can see from the Python conversion what the equal line would be:

Also, as this is the second layer, the encrypted raw data isn’t already stored in cells, but will be using SET.VALUE (which I omitted here for length purposes). This causes some issues as a local variable, e.g R23C4 isn’t really reflected in the INDIRECT API getting the cell R23C4.

The tool now runs this code via Python AST and extracts the local variables to access the results of the FORMULA statements:

And then we get the final layer of the code that actually runs within the spreadsheet. First an obfuscation layer, then an encryption layer – and finally the raw code – creating a javascript to be loaded by explorer and download the jc.cpl, which will also be executed by explorer.

Next is to convert this to Python code and run it – to extract the behavior, which won’t be hard.

Here is a video where you see the interactions with the sample through the debugger-interface while development is being done:

All this can be automated for sure, but there is still a long way to go. Many problems I encountered needs better fixes, so now when I know what is required it’s time to let that sink in and prepare for a rewrite.

And the best part, done in milliseconds…

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: