toolbox.dao.Feed
class, Feed has a simple but organized structure to store the data
Args:
multi_group (bool, optional): Allows a hierarchical stucture. Defaults to False.
**kwargs: keyword-only argument
dict_DF : attrDict assign to DF attribute
dict_PAR: attrDict assign to PAR attribute
dict_MAP: attrDict assign to MAP attribute
Basic Feed
The default mode for Feed (when multi_group=False). It has three attributes:
- DF: Dictionary that all values are Dataframes
- PAR: Dictionary that all values are Constants (int, float, string)
- MAP: Dictionary that all values are hash mapping
toolbox.dao.Feed.DF.add
method, add an item for DF
Args:
var (DataFrame): DataFrame to add to DF
property_name (str): property name
toolbox.dao.Feed.DF.delete
method, delete an item for DF
Args:
property_name (str): property in DF
toolbox.dao.Feed.DF.allitems
property, list of all items in DF
toolbox.dao.Feed.DF.df_dtype_refine
method, refine all Pandas DataFrames by assigning new data types which save more memory Based on the contents, data types of numerical columns will be downcast; Pyarrow string will be assigned for string columns.
Args:
inplace (bool, optional): affects original data or not. Defaults to False.
ignored_dfs (list, optional): list of keys that will be ignored by this method.
Returns:
object: copy of effective instance
"""
toolbox.dao.Feed.DF.key_standardization
method, standardize the dictionary key by standardize_func
Args:
standardize_func (callable, optional): string conversion function. Defaults to toolbox.string.convert_functions.upper_and_replace_space_with_underscore.
toolbox.dao.Feed.DF.column_standardization
method, standardize the column name for each dataframe under DF by standardize_func
Args:
standardize_func (callable, optional): string conversion function. Defaults to upper_and_replace_space_with_underscore.
inplace (bool, optional): whether to modify the DataFrame rather than creating a new one. Defaults to False.
Returns:
Dataframe: it would be different than original DF property if `inplace=False`
toolbox.dao.Feed.DF.size_summary
property, the size of each item; data storage units are self-adjusted
toolbox.dao.Feed.PAR.add
method, add an item for PAR
Args:
var (any): variable to add to PAR
property_name (str): property name
toolbox.dao.Feed.PAR.delete
method, delete an item for PAR
Args:
property_name (str): property in PAR
toolbox.dao.Feed.PAR.allitems
property, list of all items in PAR
toolbox.dao.Feed.PAR.export_df
method, Return a DataFrame by gathering all parameter:value pairs within PAR; two return formats are provided
Args:
format (_export_mode, optional):
"stacked" or "multi-column".
Defaults to "multi-column".
Returns:
DataFrame
toolbox.dao.Feed.PAR.key_standardization
method, standardize the dictionary key by standardize_func
Args:
standardize_func (callable, optional): string conversion function. Defaults to toolbox.string.convert_functions.upper_and_replace_space_with_underscore.
toolbox.dao.Feed.PAR.df_dtype_refine
method, refine all Pandas DataFrames by assigning new data types which save more memory Based on the contents, data types of numerical columns will be downcast; Pyarrow string will be assigned for string columns.
Args:
inplace (bool, optional): affects original data or not. Defaults to False.
ignored_dfs (list, optional): list of keys that will be ignored by this method.
Returns:
object: copy of effective instance
"""
toolbox.dao.Feed.PAR.size_summary
property, the size of each item; data storage units are self-adjusted
toolbox.dao.Feed.MAP.add
method, add an item for MAP
Args:
var (dict): hash mapping to add to MAP
property_name (str): property name
toolbox.dao.Feed.MAP.delete
method, delete an item for MAP
Args:
property_name (str): property in MAP
toolbox.dao.Feed.MAP.allitems
property, list of all items in MAP
toolbox.dao.Feed.MAP.key_standardization
method, standardize the dictionary key by standardize_func
Args:
standardize_func (callable, optional): string conversion function. Defaults to toolbox.string.convert_functions.upper_and_replace_space_with_underscore.
toolbox.dao.Feed.MAP.df_dtype_refine
method, refine all Pandas DataFrames by assigning new data types which save more memory Based on the contents, data types of numerical columns will be downcast; Pyarrow string will be assigned for string columns.
Args:
inplace (bool, optional): affects original data or not. Defaults to False.
ignored_dfs (list, optional): list of keys that will be ignored by this method.
Returns:
object: copy of effective instance
"""
toolbox.dao.Feed.MAP.size_summary
property, the size of each item; data storage units are self-adjusted
Example of basic Feed
Insert data from scratch
from toolbox.dao import Feed
feed = Feed()
feed.PAR.add("liam",property_name="user_name")
print(feed.PAR.user_name)
feed.MAP.add(
{"liam":"boy","milly":"girl"},
"sexual"
)
print(feed.MAP.sexual["milly"])
liam
girl
Initial a Feed by Loader
Insert data by passing a toolbox.dao.files.Loader
or toolbox.dao.attrDict
from toolbox.dao import Feed
from toolbox.dao.files import FileLoader
path = "projects/testing/test2"
FL = FileLoader(path, archive=False)
FL.execute()
model_input = Feed(dict_DF=FL)
print(model_input)
data attributes:
{
"DF": [
"projection_2427",
"eng_hours",
"workstation_2427",
"flowAdjust_2427_HAZ",
"conversion_2427_HAZ",
"EMSAvailableTools_2427",
]
},
{
"PAR": []
},
{
"MAP": []
}
Hierarchical Feed
When initial an instance of Feed, set multi_group = True
.
This mode enables self-named properties for Feed to create a hierarchical structure to better organize data.
toolbox.dao.Feed.add_new_group
method, add property for the instance of Feed
Args:
group_name (str): group name
**kwargs: keyword-only argument
dict_DF : attrDict assign to DF attribute for this new group
dict_PAR: attrDict assign to PAR attribute for this new group
dict_MAP: attrDict assign to MAP attribute for this new group
Once you have executed this method, a new property will be added on the instance of Feed. Each new property will just be like a basic Feed which has three attributes DF, PAR, and MAP to organize your data.
Example of Hierarchical Feed
import pandas as pd
mf = Feed(multi_group=True)
mf.add_new_group("school")
student_info = pd.DataFrame({
"name":["liam","krishna","asha"],
"department":["IEOR","IEOR","STAT"]
}
)
mf.school.DF.add(student_info,"student")
print(mf.school.DF.student)
name | department | |
---|---|---|
0 | liam | IEOR |
1 | krishna | IEOR |
2 | asha | STAT |
Initial a Feed by Loader
Here is an example of how to collaborate with toolbox.dao.files.Loader
from toolbox.dao import Feed
from toolbox.dao.files import FileLoader
path = "projects/testing/test2"
FL = FileLoader(path, archive=False)
FL.execute()
model_input = Feed(multi_group=True)
model_input.add_new_group("group_A",dict_DF=FL)
print(model_input)
data attributes:
{
"group_A": [
{
"DF": [
"projection_2427",
"eng_hours",
"workstation_2427",
"flowAdjust_2427_HAZ",
"conversion_2427_HAZ",
"EMSAvailableTools_2427",
]
},
{
"PAR": []
},
{
"MAP": []
}
]
}
Reduce memory usage
apply method df_dtype_refine
to reduce memory usage by assigning different data types.
This method also works for toolbox.dao.files.FileLoader
.
Property size_summary
helps checking the results in size.
from IPython.display import HTML
FL = FileLoader("examples/file_loader_testing/2427.tar.gz",True)
FL.execute()
display(HTML("original summary:\n"),FL.size_summary)
display(HTML("revised summary:\n"),FL.df_dtype_refine().size_summary)
original summary:
{'projection_2427': '1.93 MB',
'eng_hours': '63.74 KB',
'workstation_2427': '781.95 KB',
'flowAdjust_2427_HAZ': '30.00 MB',
'conversion_2427_HAZ': '571.81 KB',
'EMSAvailableTools_2427': '5.22 MB',
'availableSummary-Readonly': '35.17 MB'}
revised summary:
{'projection_2427': '774.86 KB',
'eng_hours': '29.17 KB',
'workstation_2427': '291.27 KB',
'flowAdjust_2427_HAZ': '10.78 MB',
'conversion_2427_HAZ': '222.98 KB',
'EMSAvailableTools_2427': '1.93 MB',
'availableSummary-Readonly': '15.81 MB'}
FL.df_dtype_refine() will only return a copy of FL with revised data types; you either assign the result to a varialbe such as
new_FL = FL.df_dtype_refine()
or use Arg inplace, just like it works for most Pandas methods, to activate the change right on original data
FL.df_dtype_refine(inplace=True)
toolbox.dao.DataMigrator
class, DataMigrator is created to migrate data from a database to Azure Blob Storage
Args:
blob_access (toolbox.dao.connector.db_access): db_access for target blob storage
db_access (toolbox.dao.connector.db_access): db_access for source database
toolbox.dao.Feed.DF.migrate
method, migrate data from source db to target Azure Blob Storage
Args:
period (tuple(int,)): (start_week,end_week), e.g., (202201,202208)
target_folder_path (str): path of the target folder
predefined_query_name (str): name of predefined query for source db
cache_name (str): label for the cache file(s) without timestamp
queires_dir (str, optional): where to looking for predefined query. Defaults to "./queries".
chunk_mode (bool, optional): pull data in chunk mode or not. Defaults to False.
enforce_dtype (bool, optional): enforce to convert data type for resulted dataframe. Defaults to True.
Example:
from toolbox.dao import DataMigrator
from toolbox.dao.connector import BlobConnector,DBConnector,parse_db_access
from toolbox.utility import set_logger
logger = set_logger('DEBUG')
blob_access = parse_db_access("db.ini","BLOB_Storage")
BC = BlobConnector(blob_access)
db_access = parse_db_access("db.ini","XEUS")
DBC = DBConnector(db_access)
DM=DataMigrator(
blob_access=blob_access,
db_access=db_access
)
DM.migrate(
period=(202338,202339),
target_folder_path="test",
cache_name="test_df",
queires_dir="./toolbox/dao/queries/",
predefined_query_name="pull_raw_lot_flow",
chunk_mode=True,
enforce_dtype=True
)
toolbox.dao.map_db_datatypes_to_dtype
method, map database data type to dtype for Pandas dataframe
Args:
database_type (str): engine.dialect.name where engine is Sqlalchemy Engine
database_driver (str): engine.driver where where engine is Sqlalchemy Engine
db_data_type (str): str(result.cursor.description[i][1]) for the ith columns of query result where result is the Sqlalchemy CursorResult
datatype_mapper (dict): map out db data types to dtypes for dataframe
Returns:
type/str: dtype for Pandas
toolbox.dao.get_column_names_and_date_types
method, get names and data types of each columns from sqlalchemy.engine.CursorResult
Args:
result (CursorResult): sqlalchemy.engine.CursorResult
Returns:
(List,List)