toolbox.dao.files
As the core data structure in the field of data analytics, Dataframe has been widely supported by many software packages especially in Python/R open-source communities. Just like a table in conventional databases or a spreadsheet, Dataframe organizes data into a 2-dimentional table of rows and columns.
The class Loader provides the basic functionality for loading data from those essential data sources/types and storing the data in Dataframe(s) which are organized by a Python Dictionary. If the data sources are a number of spreadsheets, the Python Dictionary is going to be composed of several key-value pairs that uses names of spreadsheets as key and Dataframes as value.
toolbox.dao.files.FileLoader
class
Args:
path (str): a path to a file or directory
archive (bool, optional): if an archive file is the target of the path. Defaults to False.
argument path
could be the path to a file of folder; if the file is archived, optional argument archive
should be passed with value True.
Once the instance has been initialized, use method execute
to return the results as a dictionary.
Load source data to a dictionary. Path can only point to a file or directory(includes multiple identical files).
Supported file types:
- comma-separated values (.csv)
- Excel (.xlsx, .xlsb)
- Sqlite (.db, .sqlite)
- JSON (.json)
- Apache Parquet (.parquet)
- Python pickle file (.pkl)
For loading multiple files, only supports csv, pkl, or parquet.
path can also point to an archive of file(s) of a folder, the compress method can be:
- tar.gz
- tar.bz2
- tar.xz
- zip
-
toolbox.dao.files.FileLoader.execute
method, load file(s)
Accept kwargs for following methods: pandas.read_csv
,pandas.read_excel
,pandas.read_pickle
,pandas.read_parquet
,json.load
toolbox.dao.files.FileLoader.df_dtype_refine
method, refine all Pandas DataFrames by assigning new data types which save more memory Based on the contents, data types of numerical columns will be downcast; Pyarrow string will be assigned for string columns.
Args:
inplace (bool, optional): affects original data or not. Defaults to False.
ignored_dfs (list, optional): list of keys that will be ignored by this method.
Returns:
object: copy of effective instance
"""
toolbox.dao.files.FileLoader.size_summary
property, the size of each item; data storage units are self-adjusted
Example of Loader
from toolbox.dao.files import FileLoader
# for single file
path = "./projects/test/dev.csv"
FL = FileLoader(path, archive=False)
FL.execute()
## assign a particular separator
# FL.execute(sep='|')
# for multiple files (assume a number of csv are under directory test)
path = "./projects/test"
FL = FileLoader(path, archive=False)
FL.execute()
# for archive
path = "./projects/test/dev.tar.gz"
FL = FileLoader(path, archive=True)
FL.execute()
print(FL)
print(FL.eng_hours) #alternative use: FL["eng_hours"]
{
"avaliable keys": [
"projection_2427",
"eng_hours",
"workstation_2427",
"flowAdjust_2427_HAZ",
"conversion_2427_HAZ",
"EMSAvailableTools_2427"
],
"input_path": "/home/lhsieh/projects/../test/dev.tar.gz",
"archive": false
}
Area WS_name start_ww_num engg_hours
0 PHOTO VENDOR_6I_CD 202327 252.00
1 DRY ETCH VENDOR_ADVTG_MET 202327 28.57
2 DIFFUSION VENDOR_CENT_DPNRTP 202327 28.38
3 DIFFUSION VENDOR_CENT_RPO 202327 65.68
4 DIFFUSION VENDOR_CENT_RPO2 202327 79.75
.. ... ... ... ...
174 PVD VENDOR_ENT_HM 202327 59.00
175 IMPLANT VENDOR_PLAD_HC_B2H 202327 17.47
176 IMPLANT VENDOR_PLAD_HC_BF3 202327 19.65
177 IMPLANT VENDOR_VST900P_MC 202327 120.96
178 IMPLANT VENDOR_VSTTRDXP_HC 202327 260.04
[179 rows x 4 columns]
files.FileLoader
can pass to toolbox.dao.Feed
as its attribute DF
example
from toolbox.dao import Feed
from toolbox.dao.files import FileLoader
path = "./projects/test/dev.tar.gz"
FL = FileLoader(path, archive=True)
FL.execute()
model_input = Feed(dict_DF=FL)
print(model_input)
# access DataFrame eng_hours using model_input.DF.eng_hours
data attributes:
{
"DF": [
"projection_2427",
"eng_hours",
"workstation_2427",
"flowAdjust_2427_HAZ",
"conversion_2427_HAZ",
"EMSAvailableTools_2427",
],
"MAP": [],
"PAR": []
}
print(model_input.DF.eng_hours)
Reference
JSON example
An example JSON format for Loader is like:
{
"Dataset_1": [
{
"First_name": "Liam",
"Last_name": "Hsieh",
"Weight": 140,
"Espanol": False,
},
{
"First_name": "Milly",
"Last_name": "Hsieh",
"Weight": 45,
"Espanol": False,
},
],
"Dataset_2": [
{
"Week": 202235,
"Math": "Tue-4",
"English": "Wed-2",
"Recess": "Thu-6",
},
{
"Week": 202236,
"Math": "Tue-3",
"English": "Fri-1",
"Recess": "Wed-5",
},
]
}
JSON arrays are applied for pandas DataFrame.