API

The API is the name of the part that is “webserver agnostic”, this means it’s only base stuff used by more high level code.

The only thing you can directly use here is some commande line scripts because the API is only about working with Workbook Excel files and their matrix, nothing else.

Separating this from the Django apps was needed because we shouldn’t have to load all Django layers to be able to extract Matrix from Excel files or to validate them.

Weak Decimal

Location

api.weak_decimal.wDecimal

Goal

  1. Manage numbers with perfect accuracy for floating numbers, because we are on a financial plateform where accuracy does matter;
  2. Support weak type operations between numbers and strings (like Excel does). Concretely to concatenate a string to a number;
  3. Permit divisions with zero (like Excel does). Concretely a number divided by zero or a zero divided by another number will allways return zero without raising exceptions;

Scope

Internally, Autodiag allways use weak decimal to compute datas.

This means getted cell datas numbers from database are transformed to weak decimal and formulas allways return weak decimal for numbers.

wDecimal inherit from python object decimal.Decimal then overrides some methods to implement some Excel behaviors.

Not all operators implemented in decimal.Decimal are really implemented into wDecimal, only +,``-,``*,``/,``**. Others may work, but they will return a decimal.Decimal type and not a wDecimal type, that may break chaining operations if used.

Demonstration

>>> wDecimal(4)+" alertes"
4 alertes
>>> wDecimal(40)+2
42
>>> wDecimal(40)+wDecimal(2)
42
>>> wDecimal(40)+wDecimal(2)+" alertes"
42 alertes
>>> wDecimal(40)+wDecimal(2)+u" alèrtes"
42 alèrtes
>>> wDecimal(40)+u" alèrtes"+wDecimal(2)
40 alèrtes2
>>> wDecimal(2)-wDecimal(42)+wDecimal(7)
-33
>>> wDecimal(40)*wDecimal(2)
80
>>> wDecimal(40)-wDecimal(2)
38
>>> wDecimal(40)/wDecimal(2)
20
>>> wDecimal(70)/0
0
>>> wDecimal(5)**wDecimal(5)
3125

Workbook origin

Before Autodiag, France Active was managing workbooks from Excel files that was given to the structures to fill their financial datas.

Workbook’s Excel files have been developped with multiple sheets for some kind of financial activities and balance sheets. Each sheet contains many data and formula cells. The Excel files are protected and so structures cannot modify them, only fill data in allowed cells.

There are multiple workbook variants and each of them has its own dedicated Excel file.

Each Autodiag plateform have its own workbooks, however they are similar on many things.

Workbook formats

Even we are talking about Excel files (XSL file), internally this is not really true. When the project has been started, there was no Python library that can fully acceed to all needed datas from a common Excel file (like formulas and cell style), so it has been decided to use OpenOffice Flat XML format that was an XML format that can be fully parsed with any XML parser.

The method is somewhat tricky because it involved to open the original Excel file with OpenOffice then save it as a Flat XML file. Problem is since OpenOffice 3.2 has changed this format and its successor LibreOffice has inherited from it, so the parser from the API is not up to date with it and that leads to the fact that we can’t actually update Workbook matrix from the sources. The parser need to be updated and it is too much work for now.

Each Workbook sources are actually stored in three formats :

  • XLS: Common Excel file, that is the source used to create Flat XML version and used as a template to export Structure’s Workbooks from the web interface. Provided by France Active;
  • XLSX: OpenDocument (XML) Excel file, used with xml2xsl, a java tool to generate XLS file from workbook datas. Provided by France Active;
  • FODS: Flat OpenOffice XML file, used to inspect and extract a workbook to their Python matrix. Generated from the common Excel files;

Note

This is the actual definition for used formats. It is planned to only use XSLX files (with openpyxl) and remove XSL and FODS formats.

Workbook extraction

Process

There are some steps to publish new Python workbook matrix:

  1. Inspect the source (a FODS file) to check that the parser can fully read it and there is no basic problems;
  2. Then we process to extract and generate the Python matrix;
  3. Finally we apply a modifier scheme to fix some extract flaws, like non data cells that contain some text string (that are not eligible to be data cells) or some formulas badly typed that cannot work with our parser;

Tools

There are multiple tools to proceed to extraction:

  1. Extractor to read and parse Excel workbook;
  2. Inspector that inherits from Extractor and check extracted content to validate some basic details;
  3. Publisher that inherits from Extractor and use extracted content to build Python matrix;
  4. Modifier use a modifier scheme to apply some changes to Python matrix to fix some flaws from Extractor parser;

Workbook matrix

A Python matrix contains cell coordinates, their kinds, their default datas and eventually their formulas. This is used to perform calculation between cells because like within an Excel file, cells formulas can adress to other cells from other sheets and follow many cells through many sheets to do their calculations.

We assume two virtual kinds of cell to extract: data cell and formula cell. We call them virtual because in Excel a cell is a formula cell and also a data cell.

API implements also this behavior but we need to distinct these kinds to build separated matrix. And so when extracting we assume that :

  • A cell with a specific cell style (mostly a specific background color defined in an internal setting) is a data cell to extract;
  • A cell containing a formula is a formula cell to extract, also extracting its default value as a data cell;
  • All other cells are ignored from extraction;

API is low level and unaware of Django and database and so does not have access to content from database.

To generate a workbook matrix, the API read its Excel file and store the cells datas in a Python way, transcripting Excel cells formulas to Python code that is evaluated when reading them.

Matrix

A workbook matrix is composed of multiple parts :

  • A data matrix which contains all cell values for data cells and formulas cells;
  • A formula matrix which contains all formula cells where formulas have been transcripted to their Python equivalent;
  • A chart matrix which contains all chart maps required to build them from cell datas;
  • A matrix summary which contains some shortcut, labels and coordinates to use some specific cell out of the workbook context;
  • A matrix source which contains all formula cells in the Excel syntax, this is mainly for debugging purpose;

Technically a matrix is mainly a simple Class object containing a dict where is stored sheets and their cells.

To do calculation from cell formulas, the API implements Excel sheet behaviors:

  • Excel use weak typing, the API supports it with Weak Decimal;
  • API have to support Excel functions. But only used operators and functions in Autodiag plateforms workbooks are implemented;

Data matrix sample

This is a sample taken from a Workbook for 3years activities workbook (for true this file contains two thousand of cells).

class DatasMatrix(object):
    sheet_order = [ u'Données générales', u'Bilans passés', .... ]

    def __init__(self):
        self.cells = {
            u'Données générales':{
                'C4': (u'', 'string', 'string'),
                'C5': (0, 'number', 'number'),
                'C6': (u'Association', 'string', 'string'),
                ....
            },
            u'Bilans passés':{
                'B25': (u'', 'longstring', 'string'),
                'C4': (2007, 'number', 'number'),
                'C6': (0, 'number', 'number-grouped'),
                'C8': (0, 'number', 'number-grouped'),
                'C10': (0, 'number', 'number-grouped'),
                ....
            },
            ....
        }

As you can see, the matrix store the sheet names in a separate list and cells are stored in their sheet dictionnary. Cells index is their cell coordinate and each cell contains a tuple of three required items:

  • its value (extracted and unchanged from the Excel file);
  • its type name (that is often called kind in code because type is a Python reserved word);
  • its display format name;

Extracted value is the value inputed from human if any, else it’s the default value (0 for numbers and an empty string for everything else).

Formula matrix sample

This is a sample taken from a Workbook for 3years activities workbook (for true this file contains a thousand of cells).

class FormulasMatrix(object):
    sheet_order = [ u'Données générales', u'Bilans passés', .... ]

    formulas = {
        u'Données générales':{
        },
        u'Bilans passés':{
            'C4': (u"self.Range(u'Bilans passés','D4')-wDecimal('1')", 'fo:number', 'number'),
            'C12': (u"self.Sum(self.Range(u'Bilans passés','C6','C10'))", 'fo:number', 'number-grouped'),
            'C18': (u"self.Sum(self.Range(u'Bilans passés','C13','C17'))", 'fo:number', 'number-grouped'),
            'C20': (u"self.Sum(self.Range(u'Bilans passés','C19'),self.Range(u'Bilans passés','C18'),self.Range(u'Bilans passés','C12'))", 'fo:number', 'number-grouped'),
            'D4': (u"self.Range(u'Bilans passés','E4')-wDecimal('1')", 'fo:number', 'number'),
            ....
        },
        ....
    }

The format is almost the same as data matrix excepts value is replaced by formula (converted to Python using the Matrix manager methods) and the type that is allways prefixed with fo: (to distinct a formula type for a data type).

Matrix Manager

Like Django models have managers to perform queries on their database entries, a Workbook matrix have a manager to read and write to its cells.

The base matrix manager from API (WorkbookBaseMatrix) is initialized with a Python workbook matrix to be able to find cells and perform calculations or editing cells values.

It implements Excel functions used in formulas converted to Python. Concretely the Python code for a formula uses manager instance methods to apply Excel functions like self.Sum(...) to perform an Excel sum.

Finally, how we calculate a formula

Finding a value from a data cell is simple as getting it with its sheetname and cell coordinates from the manager.

But the process to calculate the value from a formula cell is a little more complex. Example with the following sample formula matrix :

class FormulasMatrix(object):
    sheet_order = [ u'Données générales', u'Bilans passés' ]

    formulas = {
        u'Données générales':{
        },
        u'Bilans passés':{
            'C4': (u"self.Range(u'Bilans passés','D4')-wDecimal('1')", 'fo:number', 'number'),
            'G4': (u"self.Range(u'Bilans passés','C4')", 'fo:number', 'number'),
        },

If we try to calculate the ‘G4’ cell, the manager will evaluate its code that is self.Range(u'Bilans passés','C4').

Reading this the manager use its Range method that will go to find the C4 cell in the same sheet (Bilans passés), firstly in the data matrix else if not finded, in the formula matrix.

So the C4 is evaluated and it uses the Range method to get the D4 cell from the same cell, you can see that this it is not defined in formula matrix, so it is taken from the data matrix.

And finally, wDecimal('1') is subtracted from this value, this will return a wDecimal.

This is a very simple example, because usually a formula cell mix many usages of methods with multiple cells. Often, calculating a formula will bounce on many cells.