6.4. Manage Data#
This section covers some tools to work with your data.
6.4.1. DVC: A Data Version Control Tool for Your Data Science Projects#
Show code cell content
!pip install dvc
While Git excels at versioning code, managing data versions can be tricky. DVC (Data Version Control) bridges this gap by allowing you to track data changes alongside your code, while keeping the actual data separate. It’s like Git for data.
Here’s a quick start guide for DVC:
# Initialize
$ dvc init
# Track data directory
$ dvc add data # Create data.dvc
$ git add data.dvc
$ git commit -m "add data"
# Store the data remotely
$ dvc remote add -d remote gdrive://lynNBbT-4J0ida0eKYQqZZbC93juUUUbVH
# Push the data to remote storage
$ dvc push
# Get the data
$ dvc pull
# Switch between different version
$ git checkout HEAD^1 data.dvc
$ dvc checkout
6.4.2. sweetviz: Compare the similar features between 2 different datasets#
Show code cell content
!pip install sweetviz
When comparing datasets, such as training and testing sets, sweetviz helps visualize similarities and differences with ease.
Here’s how to use it:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import sweetviz as sv
X, y = load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
report = sv.compare([X_train, "train data"], [X_test, "test data"])
report.show_html()
6.4.3. quadratic: Data Science Speadsheet with Python and SQL#
If you want to use Python or SQL in an Excel sheet, use quadratic.
6.4.4. whylogs: Data Logging Made Easy#
Show code cell content
!pip install whylogs
Keeping track of dataset statistics is crucial for data quality and monitoring. whylogs makes logging dataset summaries straightforward.
Example usage:
import pandas as pd
import whylogs as why
data = {
"Fruit": ["Apple", "Banana", "Orange"],
"Color": ["Red", "Yellow", "Orange",],
"Quantity": [5, 8, 3],
}
df = pd.DataFrame(data)
# Log the DataFrame using whylogs and create a profile
profile = why.log(df).profile()
# View the profile and convert it to a pandas DataFrame
prof_view = profile.view()
prof_df = prof_view.to_pandas()
prof_df
cardinality/est | cardinality/lower_1 | cardinality/upper_1 | counts/inf | counts/n | counts/nan | counts/null | distribution/max | distribution/mean | distribution/median | ... | frequent_items/frequent_strings | type | types/boolean | types/fractional | types/integral | types/object | types/string | types/tensor | ints/max | ints/min | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
column | |||||||||||||||||||||
Color | 3.0 | 3.0 | 3.00015 | 0 | 3 | 0 | 0 | NaN | 0.000000 | NaN | ... | [FrequentItem(value='Yellow', est=1, upper=1, ... | SummaryType.COLUMN | 0 | 0 | 0 | 0 | 3 | 0 | NaN | NaN |
Fruit | 3.0 | 3.0 | 3.00015 | 0 | 3 | 0 | 0 | NaN | 0.000000 | NaN | ... | [FrequentItem(value='Orange', est=1, upper=1, ... | SummaryType.COLUMN | 0 | 0 | 0 | 0 | 3 | 0 | NaN | NaN |
Quantity | 3.0 | 3.0 | 3.00015 | 0 | 3 | 0 | 0 | 8.0 | 5.333333 | 5.0 | ... | [FrequentItem(value='8', est=1, upper=1, lower... | SummaryType.COLUMN | 0 | 0 | 3 | 0 | 0 | 0 | 8.0 | 3.0 |
3 rows Ă— 31 columns
prof_df.iloc[:, :5]
cardinality/est | cardinality/lower_1 | cardinality/upper_1 | counts/inf | counts/n | |
---|---|---|---|---|---|
column | |||||
Color | 3.0 | 3.0 | 3.00015 | 0 | 3 |
Fruit | 3.0 | 3.0 | 3.00015 | 0 | 3 |
Quantity | 3.0 | 3.0 | 3.00015 | 0 | 3 |
prof_df.columns
Index(['cardinality/est', 'cardinality/lower_1', 'cardinality/upper_1',
'counts/inf', 'counts/n', 'counts/nan', 'counts/null',
'distribution/max', 'distribution/mean', 'distribution/median',
'distribution/min', 'distribution/n', 'distribution/q_01',
'distribution/q_05', 'distribution/q_10', 'distribution/q_25',
'distribution/q_75', 'distribution/q_90', 'distribution/q_95',
'distribution/q_99', 'distribution/stddev',
'frequent_items/frequent_strings', 'type', 'types/boolean',
'types/fractional', 'types/integral', 'types/object', 'types/string',
'types/tensor', 'ints/max', 'ints/min'],
dtype='object')
6.4.5. Fluke: The Easiest Way to Move Data Around#
Transferring data between locations—such as from a remote server to cloud storage—can be cumbersome, especially with Python libraries that involve complex HTTP/SSH connections and directory handling.
Fluke simplifies this process with a user-friendly API, making it easy to manage remote data transfers with just a few lines of code.
Example usage:
from fluke.auth import RemoteAuth, AWSAuth
# This object will be used to authenticate
# with the remote machine.
rmt_auth = RemoteAuth.from_password(
hostname="host",
username="user",
password="password")
# This object will be used to authenticate
# with AWS.
aws_auth = AWSAuth(
aws_access_key_id="aws_access_key",
aws_secret_access_key="aws_secret_key")
from fluke.storage import RemoteDir, AWSS3Dir
with (
RemoteDir(auth=rmt_auth, path='/home/user/dir') as rmt_dir,
AWSS3Dir(auth=aws_auth, bucket="bucket", path='dir', create_if_missing=True) as aws_dir
):
rmt_dir.transfer_to(dst=aws_dir, recursively=True)
6.4.6. safetensors: A Simple and Safe Way to Store and Distribute Tensors#
Show code cell content
!pip install torch safetensors
PyTorch defaults to using Pickle for tensor storage, which poses security risks as malicious pickle files can execute arbitrary code upon unpickling. In contrast, safetensors specialize in securely storing tensors, guaranteeing data integrity during storage and retrieval.
safetensors also uses zero-copy operations, eliminating the need to copy data into new memory locations, thereby enabling fast and efficient data handling.
import torch
from safetensors import safe_open
from safetensors.torch import save_file
tensors = {
"weight1": torch.zeros((1024, 1024)),
"weight2": torch.zeros((1024, 1024))
}
save_file(tensors, "model.safetensors")
tensors = {}
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
for key in f.keys():
tensors[key] = f.get_tensor(key)