persia.embedding.data
Module Contents
- class persia.embedding.data.IDTypeFeature(name, data)
IDTypeFeature
is a sparse matrix in LIL format which contains categorical ID data.Example for
IDTypeFeature
:import numpy as np from persia.embedding.data import IDTypeFeature lil_matrix = [ np.array([1], dtype=np.uint64) for i in range(5) ] id_type_feature = IDTypeFeature("id_type_feature_with_single_id", lil_matrix) lil_matrix = [ np.array([], dtype=np.uint64), # allow empty sample np.array([10001], dtype=np.uint64), np.array([], dtype=np.uint64), np.array([10002], dtype=np.uint64), np.array([10010], dtype=np.uint64) ] id_type_feature = IDTypeFeature("id_type_feature_with_empty_sample", lil_matrix)
Note
IDTypeFeature
requiresnp.uint64
as type for its elements.- Parameters:
name (str) – name of
IDTypeFeature
.data (List[np.ndarray]) – A lil sparse matrix data. Requires np.uint64 as type for its elements.
- class persia.embedding.data.IDTypeFeatureWithSingleID(name, data)
The
IDTypeFeatureWithSingleID
is a special format ofIDTypeFeature
where there is only one id for each sample in the batch.IDTypeFeatureWithSingleID
only run a one-time type check compared toIDTypeFeature
. It can speed up the data preprocessing significantly with large batch size.Example for
IDTypeFeatureWithSingleID
:import numpy as np from persia.embedding.data import IDTypeFeatureWithSingleID batch_with_single_id = np.array( [10001, 10002, 10010, 10020, 10030], np.uint64 ) id_type_feature = IDTypeFeatureWithSingleID("id_type_feature_demo", batch_with_single_id)
Note
IDTypeFeatureWithSingleID
requiresnp.uint64
as type for its elements.Note
IDTypeFeatureWithSingleID
does not allow empty sample in batch data. You should useIDTypeFeature
instead in this case. See IDTypeFeature for more details.- Parameters:
name (str) – name of
IDTypeFeatureWithSingleID
.data (np.ndarray) –
IDTypeFeatureWithSingleID
data. Requires np.uint64 as type for its elements.
- class persia.embedding.data.Label(data, name=None)
Bases:
NdarrayDataBase
Label
is thesubclass
ofNdarrayDataBase
that you can add various datatype and multi-dimensional data.Example for
Label
:import numpy as np from persia.embedding.data import Label label_data = np.array([35000, 36000, 100000, 5000, 10000], dtype=np.float32) label = Label(label_data, name="income_label") label_data = np.array([True, False, True, False, True], dtype=np.bool) label = Label(label_data, name="ctr_bool_label")
Or you can add multi-dimensional data to avoid memory fragments and type checks.
import numpy as np from persia.embedding.data import Label label_data = np.array([ [True, False], [False, True], [True, True], [False, False], [False, True] ], dtype=np.bool ) label = Label(label_data, "click_with_is_adult")
- Parameters:
data (np.ndarray) – Numpy array.
name (str, optional) – name of data.
- DEFAULT_NAME = label_anonymous
- class persia.embedding.data.NdarrayDataBase(data, name=None)
The
NdarrayDataBase
is a data structure that supports various datatypes and multi-dimensional data. PERSIA needs to convert theNdarrayDataBase
to thetorch.Tensor
so the datatype that it supports is the intersection of NumPy datatype and PyTorch datatype.Following datatype is supported for
NdarrayDataBase
:datatype
np.bool
np.int8
np.int16
np.int32
np.int64
np.float32
np.float64
np.uint8
- Parameters:
data (np.ndarray) – Numpy array.
name (str, optional) – name of data.
- DEFAULT_NAME = ndarray_base
- class persia.embedding.data.NonIDTypeFeature(data, name=None)
Bases:
NdarrayDataBase
The
NonIDTypeFeature
is thesubclass
ofNdarrayDataBase
that you can add various datatypes and multi-dimensional data.Example for
NonIDTypeFeature
:import numpy as np from persia.embedding.data import NonIDTypeFeature # float32 data non_id_type_feature_data = np.array([163, 183, 161, 190 ,170], dtype=np.float32) non_id_type_feature = NonIDTypeFeature(non_id_type_feature_data, "height") # image data non_id_type_feature_data = np.zeros((5, 3, 32, 32), dtype=np.uint8) non_id_type_feature = NonIDTypeFeature(non_id_type_feature_data, "image_data")
- Parameters:
data (np.ndarray) – Numpy array.
name (str, optional) – name of data.
- DEFAULT_NAME = non_id_type_feature_anonymous
- class persia.embedding.data.PersiaBatch(id_type_features, non_id_type_features=None, labels=None, batch_size=None, requires_grad=True, meta=None)
The
PersiaBatch
is the type of dataset used internally in Persia. It wraps theIDTypeFeature
,NonIDTypeFeature
,Label
and meta bytes data.Example for
PersiaBatch
:import time import json import numpy as np from persia.embedding.data import PersiaBatch, NonIDTypeFeature, IDTypeFeature, Label batch_size = 1024 non_id_type_feature = NonIDTypeFeature(np.zeros((batch_size, 2), dtype=np.float32)) label = Label(np.ones((batch_size, 2), dtype=np.float32)) id_type_feature_num = 3 id_type_feature_max_sample_length = 100 id_type_features = [ IDTypeFeature(f"feature_{idx}", [ np.ones( (np.random.randint(id_type_feature_max_sample_length)), dtype=np.uint64 ) for _ in range(batch_size) ] ) for idx in range(id_type_feature_num)) ] meta_info = { timestamp: time.time(), weight: 0.9, } meta_bytes = json.dumps(meta_info) persia_batch = PersiaBatch(id_type_features, non_id_type_features=[non_id_type_feature], labels=[label] requires_grad=requires_grad, requires_grad=True meta=meta_bytes )
Note
Label
data should be exists if setrequires_grad=True
.- Parameters:
id_type_features (List[Union[IDTypeFeatureWithSingleID, IDTypeFeature]]) – categorical data whose datatype should be uint64.
non_id_type_features (List[NonIDTypeFeature], optional) – dense data.
labels (List[Label], optional) – labels data.
batch_size (int, optional) – number of samples in each batch.
IDTypeFeature
,NonIDTypeFeature
andLabel
should have the same batch_size.requires_grad (bool, optional) – set requires_grad for id_type_features.
meta (bytes, optional) – binary data.
- to_bytes()
Serialize persia_batch to bytes after checking.
- Return type:
bytes
- persia.embedding.data.MAX_BATCH_SIZE = 65535