6.4. Manage Data#

This section covers some tools to work with your data.

6.4.1. DVC: A Data Version Control Tool for Your Data Science Projects#

!pip install dvc

Git is a powerful tool to go back and forth different versions of your code. Is there a way that you can also control different versions of your data?

That is when DVC comes in handy. With DVC, you can keep the information about different versions of your data in Git while storing your original data somewhere else.

It is essentially like Git but is used for data. The code below shows how to use DVC.

# Initialize
$ dvc init

# Track data directory
$ dvc add data # Create data.dvc
$ git add data.dvc
$ git commit -m "add data"

# Store the data remotely
$ dvc remote add -d remote gdrive://lynNBbT-4J0ida0eKYQqZZbC93juUUUbVH

# Push the data to remote storage
$ dvc push 

# Get the data
$ dvc pull 

# Switch between different version
$ git checkout HEAD^1 data.dvc
$ dvc checkout

Link to DVC

Find step-by-step instructions on how to use DVC in my article.

6.4.2. sweetviz: Compare the similar features between 2 different datasets#

!pip install sweetviz 

Sometimes it is important to compare the similar features between 2 different datasets side by side such as comparing train and test sets. If you want to quickly compare 2 datasets through graphs, check out sweetviz.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import sweetviz as sv

X, y = load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

report = sv.compare([X_train, "train data"], [X_test, "test data"])
report.show_html()
Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
Opening in existing browser session.

Run the code above and you will generate a report similar to this:

image

Link to sweetviz

6.4.3. Fluke: The Easiest Way to Move Data Around#

Data scientists often need to transfer data between locations, such as a remote server to cloud storage. However, many Python libraries require a lot of boilerplate code to handle HTTP/SSH connections and iterate directories.

This can be cumbersome for those who want to transfer files easily. Fluke offers a simple API that allows users to interact with remote data in a few lines of code.

from fluke.auth import RemoteAuth, AWSAuth

# This object will be used to authenticate
# with the remote machine.
rmt_auth = RemoteAuth.from_password(
    hostname="host",
    username="user",
    password="password")

# This object will be used to authenticate
# with AWS.
aws_auth = AWSAuth(
    aws_access_key_id="aws_access_key",
    aws_secret_access_key="aws_secret_key")
from fluke.storage import RemoteDir, AWSS3Dir

with (
    RemoteDir(auth=rmt_auth, path='/home/user/dir') as rmt_dir,
    AWSS3Dir(auth=aws_auth, bucket="bucket", path='dir', create_if_missing=True) as aws_dir
):
    rmt_dir.transfer_to(dst=aws_dir, recursively=True)

Link to Fluke.