Thank you to the fantastic group of participants at the Human-Centered Artificial Intelligence (HCAI) Summer School Course!
Thank you to the fantastic group of participants at the Human-Centered Artificial Intelligence (HCAI) Summer School Course!
Recently, I had to run heavy experiments that my Macbook Pro just wasn’t up to spec for. Seeing how Google is offering a nice credit bonus for signing up to their Cloud platform and a sustained-use discount, I decided to give their cloud services a go. Initially, I tried out their DataLab1 app , but quickly found myself wanting more fine-grained control.
It was surprisingly easy to set up a Google Compute Engine Virtual Machine (VM) with the required hardware and software resources. In my case, I needed a moderately-sized server (16-24 cores2) with IPython/Jupyter, Tensorflow and Julia. This guide shows you how to set up a similar VM step-by-step, in 30 minutes or less.
After you sign up to Google Cloud Platform, there are three basic steps to complete:
Prerequisites: I’m going to assume you know your way around a Linux terminal.
Follow the Quickstart guide to create a new VM instance, but note the following:
Google’s Compute Engine has a sweet browser-based SSH terminal you can use to connect to your machine. We’ll be using it for additional setup below.
The VM that we instantiated comes with a 10GB SSD drive. It’s fast, but I needed more space. Follow these instructions to add more disk space.
We’ll install three major software packages: Anaconda Python, Google Tensorflow and Julia. For many data scientists, Anaconda Python should suffice, but I wanted to play with Deep Learning models, and needed to run a tensor-factorizer I had written in Julia.
Bring up your SSH terminal. Let’s create a downloads directory to keep things organized:
mkdir downloads cd downloads
There’s several python distributions around, but Anaconda is my favorite. It bundles popular scientific computing libraries into a single, coherent, easy-to-install package.
Note: The following installs Python 3, if you want Python 2.x, replace Anaconda3-X.X.X… with Anaconda2-X.X.X…
In your SSH terminal, enter:
wget http://repo.continuum.io/archive/Anaconda3-4.0.0-Linux-x86_64.sh bash Anaconda3-4.0.0-Linux-x86_64.sh
and follow the on-screen instructions. The defaults usually work fine, but answer yes to the last question about prepending the install location to PATH:
Do you wish the installer to prepend the Anaconda3 install location to PATH in your /home/haroldsoh/.bashrc ? [yes|no][no] >>> yes
To make use of Anaconda right away, source your bashrc:
Tensorflow is Google’s open-source deep learning / machine intelligence library. I’ve been using it for about a month and a few issues aside—some components (e.g., dynamic RNN cells) are still under active development—it’s a pleasure being able to develop state-of-the-art deep models in relatively few lines of code. We’re going to install the conda package contributed by Jonathan Helmus:
conda install -c jjhelmus tensorflow=0.8.0rc0
If you prefer, you can also install via pip: follow these instructions.
Julia is a fantastic language that I’ve written about before. With a MATLAB-like syntax, Julia is easy to pick-up and work in, and the real kicker is that Julia code is fast (much faster than plain Python and MATLAB). Think C/C++ speed in far fewer lines of code. To install:
sudo add-apt-repository ppa:staticfloat/juliareleases sudo add-apt-repository ppa:staticfloat/julia-deps sudo apt-get update sudo apt-get install julia
If you want to use Julia via the notebook interface, install the IJulia package.
julia -e 'Pkg.add("IJulia")'
In our final step, we’ll need to set up the Jupyter server and connect to it. The following instructions come mainly from here, with some tweaks.
Open up a SSH session to your VM. Check if you have a Jupyter configuration file:
If it doesn’t exist, create one:
jupyter notebook --generate-config
We’re going to add a few lines to your Jupyter configuration file; the file is plain text so, you can do this via your favorite editor (e.g., vim, emacs):
c = get_config() c.NotebookApp.ip = '*' c.NotebookApp.open_browser = False c.NotebookApp.port = 8123
My configuration file looks like this:
Once that’s done, you have two options for starting up the server. The first is via nohup. The second is using screen. Both ensure your server doesn’t die upon logout. Using nohup is slightly easier, but I prefer the latter; I always install screen in a remote system I’m using. Choose either Option A or B below, then move on to setting up your client.
This is the easier option. Create a notebooks directory and start our Jupyter server there.
mkdir notebooks cd notebooks nohup jupyter notebook > ~/notebook.log &
This is the more complicated option, but you’ll learn how to use screen, which I’ve found to be tremendously useful. Install screen:
sudo apt-get install screen
and start a screen session with the name jupyter:
screen -S jupyter
The -S option names our session (else, screen will assign a numeric ID). I’ve chosen “jupyter” but the name can be anything you want.
Create a notebooks directory and start the jupyter notebook server:
cd ~/ mkdir notebooks cd notebooks jupyter notebook
Press CTRL-A, D to detach from the screen and take you back to the main command line. If you want to re-attach to this screen session in the future, type:
screen -r jupyter
You can now close your SSH session if you like and Jupyter will keep running.
Now that we have the server side up and running, we need to set up a SSH tunnel so that you can securely access your notebooks.
For this, you’ll need to install the Google Cloud SDK on your local machine. Come back after it’s installed.
Now, authenticate yourself:
and initiate a SSH tunnel from your machine to the server:
gcloud compute ssh --zone=<host-zone> \ --ssh-flag="-D" --ssh-flag="1080" --ssh-flag="-N" --ssh-flag="-n" <host-name>
You’ll need to replace the <host-zone> and <host-name> with the appropriate zone and host-name of your VM (that you took note of in the first step). You’ll also find the relevant info on your VM Instances page.
Finally, start up your favorite browser with the right configuration:
<browser executable path> \ --proxy-server="socks5://localhost:1080" \ --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" \ --user-data-dir=/tmp/
Replace <browser executable path> with your full browser path on your system. See here for some common executable paths on different operating systems. Optional: Write a simple bash script called gcenotebook.sh to avoid having to type that whole long string each you want to launch the browser. My script looks like this:
#!/bin/bash /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --proxy-server="socks5://localhost:1080" \ --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" \ --user-data-dir=/tmp/
which I made executable the usual way:
chmod +x gcenotebook.sh
Finally, using the browser that you just launched, head to your server’s main notebook page:
and if everything went according to plan, you should see something like this:
With both Python and Julia installed, you can start a notebook using either as a kernel and doing stuff:
You now have a 16-core machine learning/data science compute server that you didn’t have 30 minutes ago. Have fun!
Headed to NYC in July! Will be presenting my paper on a probabilistic variant of multi-dimensional scaling (MDS) at the 2016 International Joint Conference on Artificial Intelligence (IJCAI)! The acceptance rate was below 25% so, it’s certainly satisfying that the paper was accepted.
About the work: we take a fresh Bayesian view of MDS—an old dimensionality reduction method—and find connections to popular machine learning methods such as probabilistic matrix factorization (used in recommender systems) and word embedding (for natural language processing).
The probabilistic viewpoint allows us to connect distance/similarity matching to non-parametric learning methods such as sparse Gaussian processes (GPs), and we derive a novel method called the Variational Bayesian MDS Gaussian Process (VBMDS-GP) [yes, a mouthful!]. As concrete examples, we apply it to multi-sensor localization and perhaps more interestingly, political unfolding.
In the unfolding task, we projected political candidates to a 2-d plane using their associated Wikipedia articles and ~15,000 voter preference survey done in 2004 for other candidates. The projection is not perfect since we use very simple Bag-of-Words (BoW) features—I think Sanders is a more liberal than the map implies—but is nevertheless coherent. We see our favorite political candidate, Donald Drumpf, projected to the conservative section and President Obama projected near the Clintons.
The model can be extended in lots of different ways; I’m working on using more recent variational inference techniques, plus maybe some “deep” extensions.
Workable published a short data-science (probabilistic) puzzle at http://buff.ly/1Rip3b0:
Suppose we have 4 coins, each with a different probability of throwing Heads. An unseen hand chooses a coin at random and flips it 50 times. This experiment is done several times with the resulting sequences as shown below (H = Heads, T = Tails) Write a program that will take as input the data that is collected from this experiment and estimate the probability of heads for each coin.
Well, I thought I would spend a few minutes on it and well, a few minutes turned into more than 30. My solution is simply maximum likelihood estimation (MLE), i.e., minimizing the negative log likelihood of the data given the model parameters:
where is the number of observed heads in the sequence of 50, is the coin selected ( forms a categorical distribution over the 4 coins for each sequence), and is the probability of heads for coin .
Maximum Likelihood Estimates: [0.428 ,0.307, 0.762, 0.817]
The first solution didn’t use any speed-up tricks nor the derivatives so, it should be easy to follow but is not terribly accurate or efficient. I then tried out automatic differentiation, which required minor code changes and sped up the computation significantly. This updated, faster solution (coinsoln.jl) found a slightly different result using conjugate gradients (negLL: -7.31325)
Maximum Likelihood Estimates: [0.283, 0.283,0.813,0.458]
Oh, and if I made any stupid errors, please let me know.
Over the past few months, I’ve been exploring Julia, a new open-source programming language that’s been gaining traction in the technical computing community. For the uninitiated, Julia aims to solve the “two-language problem”—having to code in a high-level language (e.g., Python/Matlab) for prototyping and a low-level language (e.g., C/C++) for speed. I found Julia easy to pick up—it’s syntactically similar to MATLAB with some important differences—and I’ve enjoyed writing a few small tools.
Admittedly, I did encounter a few problems along the way. For example, initial installation and setup was a breeze, but I had trouble getting IJulia and Gadfly to play nice. Jiahao kindly helped me resolve this problem; as a tip, installing Anaconda’s Python distribution helps avoid many issues. Although native julia packages are being built rapidly, I’ve found quality to be mixed. When I needed a mature library, it was easy to make calls to C and Python, but that’s an additional step and dependency. There are a few additional quirks with regard to package imports and function replacement (something I haven’t quite gotten down yet) and garbage collection, which can take significant CPU time if you aren’t careful. An additional thing to note is that Julia uses a just-in-time (JIT) compiler so, first time runs are typically slower.
All-in-all, I believe Julia is at the point where, personally, the pros outweigh the cons, and I’m starting to port over some of my older code. If you haven’t yet tried Julia, I highly recommend giving it a go. But like learning anything new, expect to be a little confused at times and regularly consulting the Julia documentation. Oh, and the helpful and growing Julia community.
I’ve been playing around with D3.js the past few weeks and just completed a first-cut visualization of Singapore’s ethnic demographic has changed from 2000 to 2010. Feedback is welcome, particularly on what works and what doesn’t (visually). Some context: the idea for this explorative tool came up after a few conversations with locals about the extent of ethnic integration. Several voiced opinions left me with questions about the extent of racial discrimination in Singaporean society. Although the visualization doesn’t address this question, it was a step towards a better understanding Singapore’s ethnic demography. Feel free to explore the visualization, and you can learn more at Singstats.