persia.embedding.data

Module Contents

class persia.embedding.data.IDTypeFeature(name, data)

IDTypeFeature is a sparse matrix in LIL format which contains categorical ID data.

Example for IDTypeFeature:

import numpy as np
from persia.embedding.data import IDTypeFeature

lil_matrix = [
    np.array([1], dtype=np.uint64) for i in range(5)
]
id_type_feature = IDTypeFeature("id_type_feature_with_single_id", lil_matrix)

lil_matrix = [
    np.array([], dtype=np.uint64), # allow empty sample
    np.array([10001], dtype=np.uint64),
    np.array([], dtype=np.uint64),
    np.array([10002], dtype=np.uint64),
    np.array([10010], dtype=np.uint64)
]
id_type_feature = IDTypeFeature("id_type_feature_with_empty_sample", lil_matrix)

Note

IDTypeFeature requires np.uint64 as type for its elements.

Parameters:

name (str) – name of IDTypeFeature.
data (List[np.ndarray]) – A lil sparse matrix data. Requires np.uint64 as type for its elements.

class persia.embedding.data.IDTypeFeatureWithSingleID(name, data)

The IDTypeFeatureWithSingleID is a special format of IDTypeFeature where there is only one id for each sample in the batch. IDTypeFeatureWithSingleID only run a one-time type check compared to IDTypeFeature. It can speed up the data preprocessing significantly with large batch size.

Example for IDTypeFeatureWithSingleID:

import numpy as np
from persia.embedding.data import IDTypeFeatureWithSingleID

batch_with_single_id = np.array(
    [10001, 10002, 10010, 10020, 10030], np.uint64
)
id_type_feature = IDTypeFeatureWithSingleID("id_type_feature_demo", batch_with_single_id)

Note

IDTypeFeatureWithSingleID requires np.uint64 as type for its elements.

Note

IDTypeFeatureWithSingleID does not allow empty sample in batch data. You should use IDTypeFeature instead in this case. See IDTypeFeature for more details.

Parameters:

name (str) – name of IDTypeFeatureWithSingleID.
data (np.ndarray) – IDTypeFeatureWithSingleID data. Requires np.uint64 as type for its elements.

class persia.embedding.data.Label(data, name=None)

Bases: NdarrayDataBase

Label is the subclass of NdarrayDataBase that you can add various datatype and multi-dimensional data.

Example for Label:

import numpy as np
from persia.embedding.data import Label

label_data = np.array([35000, 36000, 100000, 5000, 10000], dtype=np.float32)
label = Label(label_data, name="income_label")

label_data = np.array([True, False, True, False, True], dtype=np.bool)
label = Label(label_data, name="ctr_bool_label")

Or you can add multi-dimensional data to avoid memory fragments and type checks.

import numpy as np
from persia.embedding.data import Label

label_data = np.array([
        [True, False],
        [False, True],
        [True, True],
        [False, False],
        [False, True]
    ], dtype=np.bool
)
label = Label(label_data, "click_with_is_adult")

Parameters:

data (np.ndarray) – Numpy array.
name (str, optional) – name of data.

DEFAULT_NAME = label_anonymous

class persia.embedding.data.NdarrayDataBase(data, name=None)

The NdarrayDataBase is a data structure that supports various datatypes and multi-dimensional data. PERSIA needs to convert the NdarrayDataBase to the torch.Tensor so the datatype that it supports is the intersection of NumPy datatype and PyTorch datatype.

Following datatype is supported for NdarrayDataBase:

datatype
np.bool
np.int8
np.int16
np.int32
np.int64
np.float32
np.float64
np.uint8

Parameters:

data (np.ndarray) – Numpy array.
name (str, optional) – name of data.

DEFAULT_NAME = ndarray_base

class persia.embedding.data.NonIDTypeFeature(data, name=None)

Bases: NdarrayDataBase

The NonIDTypeFeature is the subclass of NdarrayDataBase that you can add various datatypes and multi-dimensional data.

Example for NonIDTypeFeature:

import numpy as np
from persia.embedding.data import NonIDTypeFeature

# float32 data
non_id_type_feature_data = np.array([163, 183, 161, 190 ,170], dtype=np.float32)
non_id_type_feature = NonIDTypeFeature(non_id_type_feature_data, "height")

# image data
non_id_type_feature_data = np.zeros((5, 3, 32, 32), dtype=np.uint8)
non_id_type_feature = NonIDTypeFeature(non_id_type_feature_data, "image_data")

Parameters:

data (np.ndarray) – Numpy array.
name (str, optional) – name of data.

DEFAULT_NAME = non_id_type_feature_anonymous

class persia.embedding.data.PersiaBatch(id_type_features, non_id_type_features=None, labels=None, batch_size=None, requires_grad=True, meta=None)

The PersiaBatch is the type of dataset used internally in Persia. It wraps the IDTypeFeature, NonIDTypeFeature, Label and meta bytes data.

Example for PersiaBatch:

import time
import json
import numpy as np
from persia.embedding.data import PersiaBatch, NonIDTypeFeature, IDTypeFeature, Label

batch_size = 1024

non_id_type_feature = NonIDTypeFeature(np.zeros((batch_size, 2), dtype=np.float32))

label = Label(np.ones((batch_size, 2), dtype=np.float32))

id_type_feature_num = 3
id_type_feature_max_sample_length = 100
id_type_features = [
    IDTypeFeature(f"feature_{idx}",
            [
                np.ones(
                    (np.random.randint(id_type_feature_max_sample_length)),
                    dtype=np.uint64
                )
                for _ in range(batch_size)
            ]
    ) for idx in range(id_type_feature_num))
]

meta_info = {
    timestamp: time.time(),
    weight: 0.9,
}
meta_bytes = json.dumps(meta_info)

persia_batch = PersiaBatch(id_type_features,
    non_id_type_features=[non_id_type_feature],
    labels=[label] requires_grad=requires_grad,
    requires_grad=True
    meta=meta_bytes
)

Note

Label data should be exists if set requires_grad=True.

Parameters:

id_type_features (List[Union[IDTypeFeatureWithSingleID, IDTypeFeature]]) – categorical data whose datatype should be uint64.
non_id_type_features (List[NonIDTypeFeature], optional) – dense data.
labels (List[Label], optional) – labels data.
batch_size (int, optional) – number of samples in each batch. IDTypeFeature, NonIDTypeFeature and Label should have the same batch_size.
requires_grad (bool, optional) – set requires_grad for id_type_features.
meta (bytes, optional) – binary data.

to_bytes()

Serialize persia_batch to bytes after checking.

Return type:: bytes

persia.embedding.data.MAX_BATCH_SIZE = 65535