6.3. Get Data

This section covers tools to get some data for your projects.

6.3.1. faker: Create Fake Data in One Line of Code

!pip install Faker

To quickly create fake data for testing, use faker.

from faker import Faker

fake = Faker()

fake.color_name()
'CornflowerBlue'
fake.name()
'Michael Scott'
fake.address()
'881 Patricia Crossing\nSouth Jeremy, AR 06087'
fake.date_of_birth(minimum_age=22)
datetime.date(1927, 11, 5)
fake.city()
'North Donald'
fake.job()
'Teacher, secondary school'

Link to faker

Link to my full article on faker.

6.3.2. Random User: Generate Random User Data in One Line of Code

Have you ever wanted to create fake user data for testing? Random User Generator is a free API that generates random user data. Below is how to download and use this data in your code.

import json
from urllib.request import urlopen

# Show 2 random users
data = urlopen("https://randomuser.me/api?results=2").read()
users = json.loads(data)["results"]
users
[{'gender': 'female',
  'name': {'title': 'Miss', 'first': 'Ava', 'last': 'Hansen'},
  'location': {'street': {'number': 3526, 'name': 'George Street'},
   'city': 'Worcester',
   'state': 'Merseyside',
   'country': 'United Kingdom',
   'postcode': 'K7Z 3WB',
   'coordinates': {'latitude': '11.9627', 'longitude': '17.6871'},
   'timezone': {'offset': '+9:00',
    'description': 'Tokyo, Seoul, Osaka, Sapporo, Yakutsk'}},
  'email': 'ava.hansen@example.com',
  'login': {'uuid': '253e53f9-9553-4345-9047-fb18aec51cfe',
   'username': 'heavywolf743',
   'password': 'cristina',
   'salt': 'xwnpqwtd',
   'md5': '2b5037da7d78258f167d5a3f8dc24edb',
   'sha1': 'fabbede0577b3fed686afd319d5ab794f1b35b02',
   'sha256': 'd42e2061f9c283c4548af6c617727215c79ecafc74b9f3a294e6cf09afc5906f'},
  'dob': {'date': '1948-01-21T10:26:00.053Z', 'age': 73},
  'registered': {'date': '2011-11-19T03:28:46.830Z', 'age': 10},
  'phone': '015242 07811',
  'cell': '0700-326-155',
  'id': {'name': 'NINO', 'value': 'HT 97 25 71 Y'},
  'picture': {'large': 'https://randomuser.me/api/portraits/women/60.jpg',
   'medium': 'https://randomuser.me/api/portraits/med/women/60.jpg',
   'thumbnail': 'https://randomuser.me/api/portraits/thumb/women/60.jpg'},
  'nat': 'GB'},
 {'gender': 'male',
  'name': {'title': 'Mr', 'first': 'Aubin', 'last': 'Martin'},
  'location': {'street': {'number': 8496, 'name': "Rue du Bât-D'Argent"},
   'city': 'Strasbourg',
   'state': 'Meurthe-et-Moselle',
   'country': 'France',
   'postcode': 83374,
   'coordinates': {'latitude': '-1.3192', 'longitude': '24.0062'},
   'timezone': {'offset': '+10:00',
    'description': 'Eastern Australia, Guam, Vladivostok'}},
  'email': 'aubin.martin@example.com',
  'login': {'uuid': '54b9bfa9-5e86-4335-8ae3-164d85df98e7',
   'username': 'heavyladybug837',
   'password': 'kendra',
   'salt': 'LcEMyR5s',
   'md5': '2fbd9e05d992eb74f7afcccec02581fc',
   'sha1': '530a1bc71a986415176606ea377961d2ce381e5d',
   'sha256': 'f5ee7bc47f5615e89f1729dcb49632c6b76a90ba50eb42d782e2790398ebc539'},
  'dob': {'date': '1949-04-12T05:01:31.463Z', 'age': 72},
  'registered': {'date': '2006-05-28T03:54:36.433Z', 'age': 15},
  'phone': '01-88-32-00-30',
  'cell': '06-09-79-55-81',
  'id': {'name': 'INSEE', 'value': '1NNaN48231023 75'},
  'picture': {'large': 'https://randomuser.me/api/portraits/men/65.jpg',
   'medium': 'https://randomuser.me/api/portraits/med/men/65.jpg',
   'thumbnail': 'https://randomuser.me/api/portraits/thumb/men/65.jpg'},
  'nat': 'FR'}]

Link to Random User Generator.

6.3.3. fetch_openml: Get OpenML’s Dataset in One Line of Code

OpenML has many interesting datasets. The easiest way to get OpenML’s data in Python is to use the sklearn.datasets.fetch_openml method.

In one line of code, you get the OpenML’s dataset to play with!

from sklearn.datasets import fetch_openml

monk = fetch_openml(name="monks-problems-2", as_frame=True)
print(monk["data"].head(10))
  attr1 attr2 attr3 attr4 attr5 attr6
0     1     1     1     1     2     2
1     1     1     1     1     4     1
2     1     1     1     2     1     1
3     1     1     1     2     1     2
4     1     1     1     2     2     1
5     1     1     1     2     3     1
6     1     1     1     2     4     1
7     1     1     1     3     2     1
8     1     1     1     3     4     1
9     1     1     2     1     1     1

6.3.4. Autoscraper

!pip install autoscraper

If you want to get the data from some websites, Beautifulsoup makes it easy for you to do so. But can scraping be automated even more? If you are looking for a faster way to scrape some complicated websites such as Stackoverflow, Github in a few lines of codes, try autoscraper.

All you need is to give it some texts so it can recognize the rule, and it will take care of the rest for you!

from autoscraper import AutoScraper

url = "https://stackoverflow.com/questions/2081586/web-scraping-with-python"

wanted_list = ["How to check version of python modules?"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)

for res in result:
    print(res)
How to execute a program or call a system command?
What are metaclasses in Python?
Does Python have a ternary conditional operator?
Convert bytes to a string
Does Python have a string 'contains' substring method?
How to check version of python modules?

Link to autoscraper.

6.3.5. pandas-reader: Extract Data from Various Internet Sources Directly into a Pandas DataFrame

!pip install pandas-datareader

Have you wanted to extract series data from various Internet sources directly into a pandas DataFrame? That is when pandas_reader comes in handy.

Below is the snippet to extract daily data of AD indicator from 2008 to 2018.

import os
from datetime import datetime
import pandas_datareader.data as web

df = web.DataReader(
    "AD",
    "av-daily",
    start=datetime(2008, 1, 1),
    end=datetime(2018, 2, 28),
    api_key=os.gehide-outputtenv("ALPHAVANTAGE_API_KEY"),
)

Link to pandas_reader.

6.3.6. pytrends: Get the Trend of a Keyword on Google Search Over Time

!pip install pytrends

If you want to get the trend of a keyword on Google Search over time, try pytrends.

In the code below, I use pytrends to get the interest of the keyword “data science” on Google Search from 2016 to 2021.

from pytrends.request import TrendReq
pytrends = TrendReq(hl="en-US", tz=360)
pytrends.build_payload(kw_list=["data science"])

df = pytrends.interest_over_time()
df["data science"].plot(figsize=(20, 7))
<AxesSubplot:xlabel='date'>
../_images/get_data_33_1.png

Link to pytrends

6.3.7. snscrape: Scrape Social Networking Services in Python

If you want to scrape social networking services such as Twitter, Facebook, Reddit, etc, try snscrape.

For example, you can use snsscrape to scrape all tweets from a user or get the latest 100 tweets with the hashtag #python.

# Scrape all tweets from @KhuyenTran16
snscrape twitter-user KhuyenTran16

# Save outputs
snscrape twitter-user KhuyenTran16 >> khuyen_tweets 

# Scrape 100 tweets with hashtag python
snscrape --max-results 100 twitter-hashtag python

Link to snscrape.

6.3.8. Datacommons: Get Statistics about a Location in One Line of Code

!pip install datacommons

If you want to get some interesting statistics about a location in one line of code, try Datacommons. Datacommons is a publicly available data from open sources (census.gov, cdc.gov, data.gov, etc.). Below are some statistics extracted from Datacommons.

import datacommons_pandas
import plotly.express as px 
import pandas as pd 

6.3.8.1. Find the Median Income in California Over Time

median_income = datacommons_pandas.build_time_series("geoId/06", "Median_Income_Person")
median_income.index = pd.to_datetime(median_income.index)
median_income.plot(
    figsize=(20, 10),
    x="Income",
    y="Year",
    title="Median Income in California Over Time",
)
<AxesSubplot:title={'center':'Median Income in California Overtime'}>
../_images/get_data_44_1.png

6.3.8.2. Number of People in the U.S Over Time

def process_ts(statistics: str):
    count_person = datacommons_pandas.build_time_series('country/USA', statistics)
    count_person.index = pd.to_datetime(count_person.index)
    count_person.name = statistics
    return count_person 
count_person_male = process_ts('Count_Person_Male')
count_person_female = process_ts('Count_Person_Female')
count_person = pd.concat([count_person_female, count_person_male], axis=1)

count_person.plot(
    figsize=(20, 10),
    title="Number of People in the U.S Over Time",
)
<AxesSubplot:title={'center':'Number of People in the U.S Overtime'}>
../_images/get_data_48_1.png

6.3.8.3. Number of Robberies in the US Over Time

count_robbery = datacommons_pandas.build_time_series(
    "country/USA", "Count_CriminalActivities_Robbery"
)
count_robbery.index = pd.to_datetime(count_robbery.index)
count_robbery.plot(
    figsize=(20, 10),
    title="Number of Robberies in the US Over Time",
)
<AxesSubplot:title={'center':'Number of Robberies in the US Overtime'}>
../_images/get_data_50_1.png

Link to Datacommons.

6.3.9. Get Google News Using Python

!pip install GoogleNews

If you want to get Google news in Python, use GoogleNews. GoogleNews allows you to get search results for a keyword in a specific time interval.

from GoogleNews import GoogleNews
googlenews = GoogleNews()
googlenews.set_time_range('02/01/2022','03/25/2022')
googlenews.search('funny')
googlenews.results()
[{'title': 'Hagan has fastest NHRA Funny Car run in 4 years',
  'media': 'ESPN',
  'date': 'Feb 26, 2022',
  'datetime': datetime.datetime(2022, 2, 26, 0, 0),
  'desc': '-- Matt Hagan made the quickest Funny Car run in four years Saturday, \ngiving the new Tony Stewart Racing NHRA team its first No. 1 qualifier and \nsetting the...',
  'link': 'https://www.espn.com/racing/story/_/id/33381149/matt-hagan-fastest-nhra-funny-car-pass-4-years',
  'img': ''},
 {'title': 'Full fields in Top Fuel, Funny Car, and Pro Stock promise fast ...',
  'media': 'NHRA',
  'date': 'Feb 10, 2022',
  'datetime': datetime.datetime(2022, 2, 10, 0, 0),
  'desc': 'The pits at Auto Club Raceway at Pomona will be packed with NHRA Camping \nWorld Drag Racing Series teams for the 2022 season-opening Lucas Oil NHRA...',
  'link': 'https://www.nhra.com/news/2022/full-fields-top-fuel-funny-car-and-pro-stock-promise-fast-start-winternationals',
  'img': ''},
 {'title': 'Full Cast Set for Broadway Revival of Funny Girl, Starring ...',
  'media': 'Playbill',
  'date': 'Feb 7, 2022',
  'datetime': datetime.datetime(2022, 2, 7, 0, 0),
  'desc': 'Among those newly added to the company are Peter Francis James, Ephie \nAardema, Martin Moran, and Julie Benko. By Margaret Hall. February 07, 2022.',
  'link': 'https://playbill.com/article/full-cast-set-for-broadway-revival-of-funny-girl-starring-beanie-feldstein-and-ramin-karimloo',
  'img': ''},
 {'title': 'Robert Hight tops Funny Car qualifying at season-opening Lucas Oil NHRA Winternationals',
  'media': 'ESPN',
  'date': 'Feb 18, 2022',
  'datetime': datetime.datetime(2022, 2, 18, 0, 0),
  'desc': "-- Robert Hight topped Funny Car qualifying Friday night in the NHRA \nCamping World Drag Racing Series' season-opening Lucas Oil NHRA \nWinternationals. Hight, a...",
  'link': 'https://www.espn.com/racing/story/_/id/33324340/robert-hight-tops-funny-car-qualifying-season-opening-lucas-oil-nhra-winternationals',
  'img': ''},
 {'title': 'New NHRA Funny Car Team Owner Ron Capps Throws ...',
  'media': 'Autoweek',
  'date': 'Feb 21, 2022',
  'datetime': datetime.datetime(2022, 2, 21, 0, 0),
  'desc': 'Defending Funny Car champion enters season without automaker deal after \nlong-time partner turns him down. By Susan Wade. Feb 21, 2022.',
  'link': 'https://www.autoweek.com/racing/nhra/a39160639/ron-capps-throws-dodgemopar-under-bus/',
  'img': ''},
 {'title': 'VIDEO: Beanie Feldstein, Ramin Karimloo, and More in ...',
  'media': 'Broadway World',
  'date': 'ar 9, 2022',
  'datetime': None,
  'desc': 'The highly anticipated Broadway revival of Funny Girl is beginning \nperformances this month! The musical will have its first preview at the \nAugust Wilson on...',
  'link': 'https://www.broadwayworld.com/article/VIDEO-Beanie-Feldstein-Ramin-Karimloo-and-More-in-Rehearsal-For-FUNNY-GIRL-20220309',
  'img': ''},
 {'title': 'Watch: The Funny Girl Sitzprobe, With Beanie Feldstein and ...',
  'media': 'TheaterMania',
  'date': 'ar 24, 2022',
  'datetime': None,
  'desc': "Funny Girl is headed back to Broadway. Here is a first look at the cast's \nfirst orchestra rehearsal, with snippets of stars Beanie Feldstein, Ramin \nKarimloo...",
  'link': 'https://www.theatermania.com/broadway/news/first-look-the-funny-girl-sitzprobe-with-beanie-fe_93550.html',
  'img': ''},
 {'title': 'Stephen Colbert, Funny or Die Prep Primetime Pickleball Special for CBS',
  'media': 'The Hollywood Reporter',
  'date': 'ar 15, 2022',
  'datetime': None,
  'desc': 'Stephen Colbert, Funny or Die Prep Primetime Pickleball Special for CBS. \nThe special, \'Pickled,\' will see celebrity competitors vie for the "Golden \nGherkin.".',
  'link': 'https://www.hollywoodreporter.com/tv/tv-news/stephen-colbert-funny-or-die-primetime-pickleball-cbs-1235111617/',
  'img': ''},
 {'title': 'Randy Meyer Racing to debut injected nitro Funny Car at ...',
  'media': 'NHRA',
  'date': 'ar 22, 2022',
  'datetime': None,
  'desc': "The Funny Car Chaos deal is becoming more popular here in the Midwest, so \nit's an opportunity for us to go race close to home, have some fun, and \ntake on a new...",
  'link': 'https://www.nhra.com/news/2022/randy-meyer-racing-debut-injected-nitro-funny-car-funny-car-chaos-event',
  'img': ''},
 {'title': 'Laurie Zaleski talks about her book “Funny Farm”',
  'media': 'The Washington Post',
  'date': 'Feb 25, 2022',
  'datetime': datetime.datetime(2022, 2, 25, 0, 0),
  'desc': "This is the Funny Farm, double-entendre intended: “Because it's full of \nanimals, and fit for lunatics,” Zaleski jokes of the sanctuary that she \nbuilt here,...",
  'link': 'https://www.washingtonpost.com/books/2022/02/25/funny-farm-rescue-animals/',
  'img': ''}]

Link to GoogleNews.