Cleora logo

Cleora is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data. Cleora embeds entities in n-dimensional spherical spaces utilizing extremely fast stable, iterative random projections, which allows for unparalleled performance and scalability.

Types of data which can be embedded include for example:

Cleora's input

Cleora is built as a multi-purpose "just embed it" tool, suitable for many different data types and formats. Our tool ingests relational tables with different types of entities, e.g., a table storing information about user id and url of a visited page. The table can contain categorical columns with:

Cleora algorithm will represent this table as undirected heterogeneous hypergraph:

Example:

Consider table representing shopping baskets.

Table contains following columns:

user <\t> product <\t> store

The input file contains values for each column:

user_id <\t> product_id product_id product_id <\t> store_id

In this example, every column has a type. This way we can determine if identifiers in the columns points to different or the same entity. It is possible for two columns to share a type, then the same identifier in both columns will point to the same entity. This is the case for homogeneous graphs:

user <\t> user

Based on the column format specification, Cleora performs:

Output: The final output of Cleora consists of multiple files for each (undirected) pair of entity types in the table.

Graph construction

Cleora architecture

To understand running command and for other Cleora configuration option see section Cleora configuration

Graph construction starts with the creation of a helper matrix P as a regular 2-D Rust array, which is built according to the selected expansion method (for details see section Hypergraph expansion methods).
The example in the figure above, presents Cleora algorithm with clique expansion method. In clique expansion each node is connected with any other node in the same row. Hence, a Cartesian product (all combinations) of all columns is created.
Additionally, each entity identifier from the original input file is hashed with xxhash - a fast and efficient hashing method. It allows us to store identifiers in a unified, small data format.

From the first row of input data table example:

U1 P1 P2 B1 B2

we get 4 combinations:

[4, U1hash, P1hash, B1hash]
[4, U1hash, P1hash, B2hash]
[4, U1hash, P2hash, B1hash]
[4, U1hash, P2hash, B2hash]

At the beginning, we insert the total number of combinations (in this case 4). Then we add another 3 combinations from the second row of the input.

Each pair in P represents a different type of relation, for example, user bought product, product has a brand, user bought something from a brand. For each such pair of columns we create a separate matrix M as a SparseMatrix (the matrices M will usually hold mostly zeros). Each matrix M is produced in a separate thread in a stepwise fashion. The rows of matrix P are broadcasted to all matrix M objects, and each matrix M object reads the buffer selecting the appropriate values, updating its content.
For example:
Matrix M3 (users and products) reads the hashes from column 1 storing hashes of user ids and column 2 storing hashes of product ids. After reading the first vector:

[4, U1hash, P1hash, B1hash]

the edge value for U1hash <-> P1hash equals ¼ (1 divided by the total number of Cartesian products). After reading the next vector:

[4, U1hash, P1hash, B2hash]

the edge value for U1hash <-> P1hash updates to ½ (¼ + ¼). After reading the next two, we finally have:

U1hash <-> P1hash = ½

U1hash <-> P2hash = ½

For maximum efficiency, we created a custom implementation of a sparse matrix data structure - the SparseMatrix struct. It follows the sparse matrix coordinate format (COO). Its purpose is to save space by holding only the coordinates and values of nonzero entities.

Training embeddings

The constructed graph is used to compute node embeddings representing entities in the graph (here users, products, brands). Embeddings training proceeds separately for each matrix M:

Matrix multiplication against each column of matrix T is done in a separate thread. The obtained columns of the new matrix T are subsequently merged into the full matrix. Similarly, L2-normalization is performed in multithreaded fashion across matrix columns.

Memory consumption

Every SparseMatrix object allocates space for:

During training, we need additional:

\[ 2 × d × | V | \]

objects, each occupying 4 bytes (this can be avoided by using memory-mapped files, see --in-memory-embedding-calculation argument for the program).