The API side

The API is the low level core of an engine because this where are implemented matrix and the matrix manager. We are calling it “low level” because it never use Django framework and so can be used without a webapp project, database and Django settings.

Workbooks are based on Excel files

Historically, France Active was managing workbooks from Excel files that was given to the structures to fill their financial datas. Exchanging Excel files between France Active and structures leaded to management problems. This is why they intended to make a web plateform like Inserdiag.

Workbook’s Excel files have been developped with multiple sheets for some kind of financial activities and balance sheets. Each sheet contains many data cells, some of them are open to data input for structures and some others are formula cells. The Excel files are protected and so structures cannot modify them, only fill data in allowed cells.

There are multiple workbook variants and each of them has its own dedicated Excel file. In fact, currently there is 3 variants for Inserdiag and Culturdiag workbooks : the first for 3 years activities, a second for 5 years activities and the last for Multiple activities (on 3years). Note that Inserdiag and Culturdiag do not share any workbook, all of them have their own base (that is the one on 3years activities).

Workbook formats in details

Even we are talking about Excel files (XSL file), internally this is not really true. When the project has been started, there was no Python library that can fully acceed to all needed datas from a common Excel file (like formulas and cell style), so it has been decided to use OpenOffice Flat XML format that was an XML format that can be fully parsed with any XML parser.

The method is somewhat tricky because it involved to open the original Excel file with OpenOffice then save it as a Flat XML file. Problem is since OpenOffice 3.2 has changed this format and its successor LibreOffice has inherited from it, so the parser from the API is not up to date with it and that leads to the fact that we can’t actually update Workbook matrix from the sources. The parser need to be updated and it is too much work for now.

Note

For Autodiag2 we have an opportunity with openpyxl that seems fully usable to extract all datas from an XSLX file (that is an OpenDocument format) but we will need to rewrite the parser for this.

Each Workbook sources are actually stored for three formats :

  • XLS: Common Excel file, that is the source used to create Flat XML version and used as a template to export Structure’s Workbooks from the web interface. Provided by France Active;
  • XLSX: OpenDocument (XML) Excel file, only for some API tools to inspect some datas but not really used. Provided by France Active;
  • FODS: Flat OpenOffice XML file, used to inspect and extract a workbook to their Python matrix. Generated from the common Excel files;

Even we mainly use the FODS files and XLS files, we still maintains all versions to have fully synchronized sources.

Engine API

The main API goal is to inspect, parse and extract Excel file to a Python matrix, and also to implement base methods to calculate formulas using the matrix. A workbook matrix contains cell coordinates, their kinds, their default datas and eventually their formulas. And all of this is used to perform calculation between cells because like within an Excel file, cells formulas can adress to other cells from other sheets and follow many cells through many sheets to do their calculations.

We assume two virtual kinds of cell to extract: data cell and formula cell. They are virtual because for Excel a cell is a formula cell AND a data cell. Even the engine implement also this behavior, we need to distinct these kinds to build separated matrix. And so when extracting we assume that :

  • A cell with a specific cell style (mostly a specific background color defined in an internal setting) is a data cell to extract;
  • A cell containing a formula is formula cell to extract, also with extract its default value as a data cell;
  • All other cells are ignored from extraction;

To do calculation from cell formulas, the API implements Excel sheet behaviors that is often terribly not what is Python maths behaviors. For example, Excel use weak typing (somewhat like Javascript) with string and numbers that the API must support in operators (like add, multiply, div, etc..). Also the API have to support Excel functions. Because the goal is not to totally implement Excel features, only used operators and functions in the different plateform’s workbook are implemented.

API is low level, don’t use Django and so does not have access to other cell datas that the default ones extracted from the Excel files.

To generate a workbook matrix, the API read its Excel file and store the cells datas in a Python way, transcripting Excel cells formulas to Python code that is evaluated when reading them.

Matrix

A workbook matrix is composed of multiple parts :

  • A data matrix which contains all cell values for data cells and formulas cells;
  • A formula matrix which contains all formula cells where formulas have been transcripted to their Python equivalent;
  • A chart matrix which contains all chart maps required to build them from cell datas;
  • A matrix summary which contains some shortcut, labels and coordinates to use some specific cell out of the workbook context;
  • A matrix source which contains all formula cells in the Excel syntax, this is mainly for debugging purpose;

Technically a matrix is mainly a simple Class object containing a dict where is stored sheets and their cells.

Data matrix sample

This is a sample taken from a Workbook for 3years activities, in reality this file contains two thousand of cells.

class DatasMatrix(object):
    sheet_order = [ u'Données générales', u'Bilans passés', .... ]

    def __init__(self):
        self.cells = {
            u'Données générales':{
                'C4': (u'', 'string', 'string'),
                'C5': (0, 'number', 'number'),
                'C6': (u'Association', 'string', 'string'),
                ....
            },
            u'Bilans passés':{
                'B25': (u'', 'longstring', 'string'),
                'C4': (2007, 'number', 'number'),
                'C6': (0, 'number', 'number-grouped'),
                'C8': (0, 'number', 'number-grouped'),
                'C10': (0, 'number', 'number-grouped'),
                ....
            },
            ....
        }

As you can see, the matrix store the sheet names in a separate list and cells are stored in their sheet dictionnary. Cells index is their cell coordinate and each cell contains a tuple of three required items : its value (extracted and unchanged from the Excel file), its type (that is sometimes called kind in the code because type is a reserved word) and its display format.

The extracted value is the value input from human if any, else it’s the default value (0 for numbers and an empty string for everything else).

Formula matrix sample

This is a sample taken from a Workbook for 3years activities, in reality this file contains a thousand of cells.

class FormulasMatrix(object):
    sheet_order = [ u'Données générales', u'Bilans passés', .... ]

    formulas = {
        u'Données générales':{
        },
        u'Bilans passés':{
            'C4': (u"self.Range(u'Bilans passés','D4')-wDecimal('1')", 'fo:number', 'number'),
            'C12': (u"self.Sum(self.Range(u'Bilans passés','C6','C10'))", 'fo:number', 'number-grouped'),
            'C18': (u"self.Sum(self.Range(u'Bilans passés','C13','C17'))", 'fo:number', 'number-grouped'),
            'C20': (u"self.Sum(self.Range(u'Bilans passés','C19'),self.Range(u'Bilans passés','C18'),self.Range(u'Bilans passés','C12'))", 'fo:number', 'number-grouped'),
            'D4': (u"self.Range(u'Bilans passés','E4')-wDecimal('1')", 'fo:number', 'number'),
            ....
        },
        ....
    }

The format is almost the same than the data matrix apart from first tuple contains the formula (in Python, using the manager methods) and the type that is allways prefixed with fo: (to distinct a formula type).

How a matrix is extracted

There are many various tools to assist you to extract and the process is somewhat complicated. To resume, we first inspect the source (a FODS file) to check that the parser can fully read it and there is no problems.

Then we process to extract and generate the matrix.

Finally we apply a scheme of patch to correct some extract flaw, because there is some detail the parser cannot get some cell (often with non data cells that contain some text string needed to be assumed as data cells) or some formula badly typed that cannot work with our parser.

Manager

Like Django models have managers to perform queries on their database entries, a Workbook matrix have a manager to read and write to its cells.

The base matrix manager (badly named WorkbookBaseMatrix..) is instantiated with a matrix to be able to find cells and perform calculations or editing cells values.

It is the manager that implements Excel functions used in formulas transcripted to Python. In effect, the Python code for a formula use manager instance methods to apply Excel functions like self.Sum(...) to perform an Excel sum. Calculations are done through the wDecimal object, see below.

wDecimal

Because we are doing Financial plateform, we need a strong accuracy on calculations and so we need to use the Python Decimal object that can support float number without to do any round effect with any operator, like Python float can do in some case.

But also with Excel weak typing, sometime some formula can try to mix string and numbers in a way that is not supported from Decimal (nor Float and Integer) and so we needed to extend the Decimal object to implement this weak typing with operators.

This is why internally, matrix and manager allways use wDecimal for numbers, an extended Decimal object.

wDecimal support string concatenation from left or the right, allways returns a wDecimal object from any operation and return a simple zero for division by zero without raising an exception.

>>> wDecimal(4)+" alertes"
4 alertes
>>> wDecimal(40)+2
42
>>> wDecimal(40)+wDecimal(2)
42
>>> wDecimal(40)+wDecimal(2)+" alertes"
42 alertes
>>> wDecimal(40)+wDecimal(2)+u" alèrtes"
42 alèrtes
>>> wDecimal(40)+u" alèrtes"+wDecimal(2)
40 alèrtes2
>>> wDecimal(2)-wDecimal(42)+wDecimal(7)
-33
>>> wDecimal(40)*wDecimal(2)
80
>>> wDecimal(40)-wDecimal(2)
38
>>> wDecimal(40)/wDecimal(2)
20
>>> wDecimal(70)/0
0
>>> wDecimal(5)**wDecimal(5)
3125

Finally, how we calculate a formula

Finding a value from a data cell is simple as getting it with its sheetname and cell coordinates from the manager.

But the process to calculate the value from a formula cell is a little more complex. Example with the following sample formula matrix :

class FormulasMatrix(object):
    sheet_order = [ u'Données générales', u'Bilans passés' ]

    formulas = {
        u'Données générales':{
        },
        u'Bilans passés':{
            'C4': (u"self.Range(u'Bilans passés','D4')-wDecimal('1')", 'fo:number', 'number'),
            'G4': (u"self.Range(u'Bilans passés','C4')", 'fo:number', 'number'),
        },

If we try to calculate the ‘G4’ cell, the manager will evaluate its code that is self.Range(u'Bilans passés','C4').

Reading this, the manager use its Range method that will go to find the C4 cell in the same sheet (Bilans passés) first in the data matrix and if not finded, in the formula matrix.

So the C4 is evaluated and it use the Range method to get the D4 cell from the same cell, you can see that this it is not defined in formula matrix, so it is taken from the data matrix.

And finally, wDecimal('1') is subtracted from this value, this will return a wDecimal.

This is a very simple example, because usually formula cell mix many usage of Range and other methods, with multiple cells. Often, calculate a formula will bounce on many cells, it can be painful to debug if you have to.