A Data-Driven Approach To Cryptocurrency Speculation

How do Bitcoin markets behave? What are the causes of the sudden spikes and dips in cryptocurrency values? Are the markets for different altcoins, such as Litecoin and Ripple, inseparably linked or largely independent? How can we predict what will happen next?

Articles on cryptocurrencies, such as Bitcoin and Ethereum, are rife with speculation these days, with hundreds of self-proclaimed experts advocating for the trends that they expect to emerge. What is lacking from many of these analyses is a strong data analysis foundation to backup the claims.

The goal of this article is to provide an easy introduction to cryptocurrency analysis using Python. We will walk through a simple Python script to retrieve, analyze, and visualize data on different cryptocurrencies. In the process, we will uncover an interesting trend in how these volatile markets behave, and how they are evolving.

Combined Altcoin Prices

This is not a post explaining what cryptocurrencies are (if you want one, I would recommend this great overview), nor is it an opinion piece on which specific currencies will rise and which will fall. Instead, all that we are concerned about in this tutorial is procuring the raw data and uncovering the stories hidden in the numbers.

Step 1 - Setup Your Data Laboratory

The tutorial is intended to be accessible for enthusiasts, engineers, and data scientists at all skill levels. The only skills that you will need are a basic understanding of Python and enough knowledge of the command line to setup a project.

Step 1.1 - Install Anaconda

The easiest way to install the dependencies for this project from scratch is to use Anaconda, a prepackaged Python data science ecosystem and dependency manager.

To setup Anaconda, I would recommend following the official installation instructions - https://www.continuum.io/downloads.

If you're an advanced user, and you don't want to use Anaconda, that's totally fine; I'll assume you don't need help installing the required dependencies. Feel free to skip to section 2.

Step 1.2 - Setup an Anaconda Project Environment

Once Anaconda is installed, we'll want to create a new environment to keep our dependencies organized.

Run conda create --name cryptocurrency-analysis python=3 to create a new Anaconda environment for our project.

Next, run source activate cryptocurrency-analysis (on Linux/macOS) or activate cryptocurrency-analysis (on windows) to activate this environment.

Finally, run conda install numpy pandas nb_conda jupyter plotly quandl to install the required dependencies in the environment. This could take a few minutes to complete.

Why use environments? If you plan on developing multiple Python projects on your computer, it is helpful to keep the dependencies (software libraries and packages) separate in order to avoid conflicts. Anaconda will create a special environment directory for the dependencies for each project to keep everything organized and separated.

Step 1.3 - Start An Interative Jupyter Notebook

Once the environment and dependencies are all set up, run jupyter notebook to start the iPython kernel, and open your browser to http://localhost:8888/. Create a new Python notebook, making sure to use the Python [conda env:cryptocurrency-analysis] kernel.

Empty Jupyer Notebook

Step 1.4 - Import the Dependencies At The Top of The Notebook

Once you've got a blank Jupyter notebook open, the first thing we'll do is import the required dependencies.

In [1]:
import os
import numpy as np
import pandas as pd
import pickle
import quandl
from datetime import datetime

We'll also import Plotly and enable the offline mode.

In [2]:
import plotly.offline as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
In [3]:
quandl.ApiConfig.api_key = os.environ['QUANDL_API_KEY']

Step 2 - Retrieve Bitcoin Pricing Data

Now that everything is set up, we're ready to start retrieving data for analysis. First, we need to get Bitcoin pricing data using Quandl's free Bitcoin API.

Step 2.1 - Define Quandl Helper Function

To assist with this data retrieval we'll define a function to download and cache datasets from Quandl.

In [4]:
def get_quandl_data(quandl_id):
    '''Download and cache Quandl dataseries'''
    cache_path = '{}.pkl'.format(quandl_id).replace('/','-')
        f = open(cache_path, 'rb')
        df = pickle.load(f)   
        print('Loaded {} from cache'.format(quandl_id))
    except (OSError, IOError) as e:
        print('Downloading {} from Quandl'.format(quandl_id))
        df = quandl.get(quandl_id, returns="pandas")
        print('Cached {} at {}'.format(quandl_id, cache_path))
    return df

We're using pickle to serialize and save the downloaded data as a file, which will prevent our script from re-downloading the same data each time we run the script. The function will return the data as a Pandas dataframe. If you're not familiar with dataframes, you can think of them as super-powered Python spreadsheets.

Step 2.2 - Pull Kraken Exchange Pricing Data

Let's first pull the historical Bitcoin exchange rate for the Kraken Bitcoin exchange.

In [5]:
# Pull Kraken BTC price exchange data
btc_usd_price_kraken = get_quandl_data('BCHARTS/KRAKENUSD')
Loaded BCHARTS/KRAKENUSD from cache

We can inspect the first 5 rows of the dataframe using the head() method.

In [6]:
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
2014-01-07 874.67040 892.06753 810.00000 810.00000 15.622378 13151.472844 841.835522
2014-01-08 810.00000 899.84281 788.00000 824.98287 19.182756 16097.329584 839.156269
2014-01-09 825.56345 870.00000 807.42084 841.86934 8.158335 6784.249982 831.572913
2014-01-10 839.99000 857.34056 817.00000 857.33056 8.024510 6780.220188 844.938794
2014-01-11 858.20000 918.05471 857.16554 899.84105 18.748285 16698.566929 890.671709

Next, we'll generate a simple chart as a quick visual verification that the data looks correct.

In [7]:
# Chart the BTC pricing data
btc_trace = go.Scatter(x=btc_usd_price_kraken.index, y=btc_usd_price_kraken['Weighted Price'])

Here, we're using Plotly for generating our visualizations. This is a less traditional choice than some of the more established Python data visualization libraries such as Matplotlib, but I think Plotly is a great choice since it produces fully-interactive charts using D3.js. These charts have attractive visual defaults, are easy to explore, and are very simple to embed in web pages.

As a quick sanity check, you should compare the generated chart with publically available graphs on Bitcoin prices(such as those on Coinbase), to verify that the downloaded data is legit.

Step 2.3 - Pull Pricing Data From More BTC Exchanges

You might have noticed a hitch in this dataset - there are a few notable down-spikes, particularly in late 2014 and early 2016. These spikes are specific to the Kraken dataset, and we obviously don't want them to be reflected in our overall pricing analysis.

The nature of Bitcoin exchanges is that the pricing is determined by supply and demand, hence no single exchange contains a true "master price" of Bitcoin. To solve this issue, along with that of down-spikes, we'll pull data from three more major Bitcoin changes to calculate an aggregate Bitcoin price index.

First, we will download the data from each exchange into a dictionary of dataframes.

In [8]:
# Pull pricing data for 3 more BTC exchanges
exchanges = ['COINBASE','BITSTAMP','ITBIT']

exchange_data = {}

exchange_data['KRAKEN'] = btc_usd_price_kraken

for exchange in exchanges:
    exchange_code = 'BCHARTS/{}USD'.format(exchange)
    btc_exchange_df = get_quandl_data(exchange_code)
    exchange_data[exchange] = btc_exchange_df
Loaded BCHARTS/ITBITUSD from cache
Step 2.4 - Merge All Of The Pricing Data Into A Single Dataframe

Next, we will define a simple function to merge a common column of each dataframe into a new combined dataframe.

In [9]:
def merge_dfs_on_column(dataframes, labels, col):
    '''Merge a single column of each dataframe into a new combined dataframe'''
    series_dict = {}
    for index in range(len(dataframes)):
        series_dict[labels[index]] = dataframes[index][col]
    return pd.DataFrame(series_dict)

Now we will merge all of the dataframes together on their "Weighted Price" column.

In [10]:
# Merge the BTC price dataseries' into a single dataframe
btc_usd_datasets = merge_dfs_on_column(list(exchange_data.values()), list(exchange_data.keys()), 'Weighted Price')

Finally, we can preview last five rows the result using the tail() method, to make sure it looks ok.

In [11]:
2017-08-17 4338.694675 4334.115210 4334.449440 4346.508031
2017-08-18 4180.171091 4167.053043 4174.715155 4195.697579
2017-08-19 4030.604133 4096.284462 4052.981179 4121.371679
2017-08-20 4054.143713 4105.412784 4099.880702 4114.258059
2017-08-21 4009.725428 4023.056275 4008.411018 4047.388345
Step 2.5 - Visualize The Pricing Datasets

The next logical step is to visualize how these pricing datasets compare. For this, we'll define a helper function to provide a single-line command to compare each column in the dataframe on a graph using Plotly.

In [12]:
def df_scatter(df, title, seperate_y_axis=False, y_axis_label='', scale='linear', initial_hide=False):
    '''Generate a scatter plot of the entire dataframe'''
    label_arr = list(df)
    series_arr = list(map(lambda col: df[col], label_arr))
    layout = go.Layout(
            showticklabels= not seperate_y_axis,
    y_axis_config = dict(
        type=scale )
    visibility = 'visible'
    if initial_hide:
        visibility = 'legendonly'
    # Form Trace For Each Series
    trace_arr = []
    for index, series in enumerate(series_arr):
        trace = go.Scatter(
        # Add seperate axis for the series
        if seperate_y_axis:
            trace['yaxis'] = 'y{}'.format(index + 1)
            layout['yaxis{}'.format(index + 1)] = y_axis_config    

    fig = go.Figure(data=trace_arr, layout=layout)

In the interest of brevity, I won't go too far into how this helper function works. Check out the documentation for Pandas and Plotly if you would like to learn more.

With the function defined, we can compare our pricing datasets like so.

In [13]:
# Plot all of the BTC exchange prices
df_scatter(btc_usd_datasets, 'Bitcoin Price (USD) By Exchange')
Step 2.6 - Clean and Aggregate the Pricing Data

We can see that, although the four series follow roughly the same path, there are various irregularities in each that we'll want to get rid of.

Let's remove all of the zero values from the dataframe, since we know that the price of Bitcoin has never been equal to zero in the timeframe that we are examining.

In [14]:
# Remove "0" values
btc_usd_datasets.replace(0, np.nan, inplace=True)

When we re-chart the dataframe, we'll see a much cleaner looking chart without the spikes.

In [15]:
# Plot the revised dataframe
df_scatter(btc_usd_datasets, 'Bitcoin Price (USD) By Exchange')

We can now calculate a new column, containing the daily average Bitcoin price across all of the exchanges.

In [16]:
# Calculate the average BTC price as a new column
btc_usd_datasets['avg_btc_price_usd'] = btc_usd_datasets.mean(axis=1)

This new column is our Bitcoin pricing index! Let's chart that column to make sure it looks ok.

In [17]:
# Plot the average BTC price
btc_trace = go.Scatter(x=btc_usd_datasets.index, y=btc_usd_datasets['avg_btc_price_usd'])

Yup, looks good. We'll use this aggregate pricing series later on, in order to convert the exchange rates of other cryptocurrencies to USD.

Step 3 - Retrieve Altcoin Pricing Data

Now that we have a solid time series dataset for the price of Bitcoin, let's pull in some data on non-Bitcoin cryptocurrencies, commonly referred to as altcoins.

Step 3.1 - Define Poloniex API Helper Functions

For retrieving data on cryptocurrencies we'll be using the Poloniex API. To assist in the altcoin data retrieval, we'll define two helper functions to download and cache JSON data from this API.

First, we'll define get_json_data, which will download and cache JSON data from a provided URL.

In [18]:
def get_json_data(json_url, cache_path):
    '''Download and cache JSON data, return as a dataframe.'''
        f = open(cache_path, 'rb')
        df = pickle.load(f)   
        print('Loaded {} from cache'.format(json_url))
    except (OSError, IOError) as e:
        print('Downloading {}'.format(json_url))
        df = pd.read_json(json_url)
        print('Cached response at {}'.format(json_url, cache_path))
    return df

Next, we'll define a function to format Poloniex API HTTP requests and call our new get_json_data function to save the resulting data.

In [19]:
base_polo_url = 'https://poloniex.com/public?command=returnChartData&currencyPair={}&start={}&end={}&period={}'
start_date = datetime.strptime('2015-01-01', '%Y-%m-%d') # get data from the start of 2015
end_date = datetime.now() # up until today
pediod = 86400 # pull daily data (86,400 seconds per day)

def get_crypto_data(poloniex_pair):
    '''Retrieve cryptocurrency data from poloniex'''
    json_url = base_polo_url.format(poloniex_pair, start_date.timestamp(), end_date.timestamp(), pediod)
    data_df = get_json_data(json_url, poloniex_pair)
    data_df = data_df.set_index('date')
    return data_df

This function will take a cryptocurrency pair string (such as 'BTC_ETH') and return the dataframe containing the historical exchange rate of the two currencies.

Step 3.2 - Download Trading Data From Poloniex

Most altcoins cannot be bought directly with USD; to acquire these coins individuals often buy Bitcoins and then trade the Bitcoins for altcoins on cryptocurrency exchanges. For this reason we'll be downloading the exchange rate to BTC for each coin, and then we'll use our existing BTC pricing data to convert this value to USD.

We'll download exchange data for nine of the top cryptocurrencies - Ethereum, Litecoin, Ripple, Ethereum Classic, Stellar, Dashcoin, Siacoin, Monero, and NEM.

In [20]:
altcoins = ['ETH','LTC','XRP','ETC','STR','DASH','SC','XMR','XEM']

altcoin_data = {}
for altcoin in altcoins:
    coinpair = 'BTC_{}'.format(altcoin)
    crypto_price_df = get_crypto_data(coinpair)
    altcoin_data[altcoin] = crypto_price_df
Loaded https://poloniex.com/public?command=returnChartData&currencyPair=BTC_ETH&start=1420045200.0&end=1503369183.168025&period=86400 from cache
Loaded https://poloniex.com/public?command=returnChartData&currencyPair=BTC_LTC&start=1420045200.0&end=1503369183.168025&period=86400 from cache
Loaded https://poloniex.com/public?command=returnChartData&currencyPair=BTC_XRP&start=1420045200.0&end=1503369183.168025&period=86400 from cache
Loaded https://poloniex.com/public?command=returnChartData&currencyPair=BTC_ETC&start=1420045200.0&end=1503369183.168025&period=86400 from cache
Loaded https://poloniex.com/public?command=returnChartData&currencyPair=BTC_STR&start=1420045200.0&end=1503369183.168025&period=86400 from cache
Loaded https://poloniex.com/public?command=returnChartData&currencyPair=BTC_DASH&start=1420045200.0&end=1503369183.168025&period=86400 from cache
Loaded https://poloniex.com/public?command=returnChartData&currencyPair=BTC_SC&start=1420045200.0&end=1503369183.168025&period=86400 from cache
Loaded https://poloniex.com/public?command=returnChartData&currencyPair=BTC_XMR&start=1420045200.0&end=1503369183.168025&period=86400 from cache
Loaded https://poloniex.com/public?command=returnChartData&currencyPair=BTC_XEM&start=1420045200.0&end=1503369183.168025&period=86400 from cache

Now we have a dictionary of 9 dataframes, each containing the historical daily average exchange prices between the altcoin and Bitcoin.

We can preview the last few rows of the Ethereum price table to make sure it looks ok.

In [21]:
close high low open quoteVolume volume weightedAverage
2017-08-18 0.071321 0.072906 0.069231 0.070200 153816.590806 10908.476511 0.070919
2017-08-19 0.070587 0.072988 0.070000 0.071321 179797.304636 12841.666823 0.071423
2017-08-20 0.073525 0.073710 0.070400 0.070690 100756.634696 7213.589872 0.071594
2017-08-21 0.080500 0.087044 0.071717 0.073500 491598.852480 39587.121362 0.080527
2017-08-22 0.081237 0.086280 0.079100 0.080500 88711.830886 7344.130450 0.082786
Step 3.3 - Convert Prices to USD

Since we now have the exchange rate for each cryptocurrency to Bitcoin, and we have the Bitcoin/USD historical pricing index, we can directly calculate the USD price series for each altcoin.

In [22]:
# Calculate USD Price as a new column in each altcoin dataframe
for altcoin in altcoin_data.keys():
    altcoin_data[altcoin]['price_usd'] =  altcoin_data[altcoin]['weightedAverage'] * btc_usd_datasets['avg_btc_price_usd']

Here, we've created a new column in each altcoin dataframe with the USD prices for that coin.

Next, we can re-use our merge_dfs_on_column function from earlier to create a combined dataframe of the USD price for each cryptocurrency.

In [23]:
# Merge USD price of each altcoin into single dataframe 
combined_df = merge_dfs_on_column(list(altcoin_data.values()), list(altcoin_data.keys()), 'price_usd')

Easy. Now let's also add the Bitcoin prices as a final column to the combined dataframe.

In [24]:
# Add BTC price to the dataframe
combined_df['BTC'] = btc_usd_datasets['avg_btc_price_usd']

Now we should have a single dataframe containing daily USD prices for the ten cryptocurrencies that we're examining.

Let's reuse our df_scatter function from earlier to chart all of the cryptocurrency prices against each other.

In [25]:
# Chart all of the altocoin prices
df_scatter(combined_df, 'Cryptocurrency Prices (USD)', seperate_y_axis=False, y_axis_label='Coin Value (USD)', scale='log')