GPU Accelerated Matrix Factorization for Recommender Systems
21 December 2018
Matrix Factorization (MF) is a popular algorithm used to power many recommender systems. Efficient and scalable MF algorithms are essential in order to train on the massive datasets that large scale recommender systems utilize. This blog post presents cu2rec, a matrix factorization algorithm written in CUDA. With a single NVIDIA GPU, cu2rec can be 10x times faster than state of the art sequential algorithms while reaching similar error metrics.
read moreRunning Jupyter Notebook on NYU HPC in 3 Clicks
18 November 2018
I’ve been lately getting a lot of question about my setup for NYU’s High Performance Computing cluster and how it can be made more convenient to use. In this post, I’m going to be sharing my tips and tricks for increasing the productivity when working with remote clusters that use the Slurm Workload Manager, such as NYU’s Prince cluster.
read moreHierarchical Clustering and its Applications
26 October 2018
Clustering is one of the most well known techniques in Data Science. From customer segmentation to outlier detection, it has a broad range of uses, and different techniques that fit different use cases. In this blog post we will take a look at hierarchical clustering, which is the hierarchical application of clustering techniques.
read moreRecommender Systems: From Filter Bubble to Serendipity
09 October 2018
Recommender systems power a lot of our day to day interactions with the content we see on the internet. With over 2.5 quintillion bytes of data created each day, the last two years alone make up 90% of the data in the world [1]. We produce content at a level that is simply impossible to consume in one lifetime, and that makes recommender systems inevitable. However, as Uncle Ben said, with great power comes great responsibility. Here I talk about some of the practical and ethical problems that recommender systems raise, and how we can go about solving them.
read moreRepresentation Learning through Matrix Factorization
10 September 2018
Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) has been around for a while, and have been successfully utilized for learning intermediary representations of data for quite some time. This post will be recap on what they actually do, and how they work.
read moreA case for better Python docstrings
18 August 2018
I have been using Python daily for 3 years, and working professionally with it for about a year now. It is a great language, and I’m grateful that I get to work with it every day. However, my biggest gripe with Python was the rather uninspiring docstrings found online and in style guides. Here, I propose a clean, explicit, and visually appealing alternative.
read moreWhy you should use PCA before Decision Trees
11 August 2018
Dimensionality Reduction techniques have been consistently useful in Data Science and Machine Learning. It can reduce training times, allow you to remove features that do not hold any predictive value, and it even works for noise reduction. In this blog post, we are going to focus on why it might even make your classifier perform better.
read moreIntroducing persistd
05 August 2018
persistd is a workspace/workflow manager made for multi-tasking developers. It allows you to persist your virtual desktop over multiple reboots. Automatically open all your relevant programs, and close them when you’re done for the day. Never fear the Windows updates again.
read moreIntroducing Books2Rec
14 May 2018
I am proud to announce Books2Rec, the book recommendation system I have been working for the last couple of months, is live. Using your Goodreads profile, Books2Rec uses Machine Learning methods to provide you with highly personalized book recommendations. Don’t have a Goodreads profile? We’ve got you covered - just search for your favorite book.
read moreData Science workflow using Ubuntu Subsystem on Windows
06 May 2018
Microsoft’s latest push for bringing developers to Windows comes in the form of embracing Linux as part of their system. Windows Subsystem for Linux, also known as WSL, has been around for over a year now. After getting fed up with using Linux VMs for development environment, and later getting fed up with having to switch operating systems in a dual-boot config, I was ready to try it. I’m very glad that I did.
read moreAdventures with RapidMiner
01 March 2018
For our Big Data Science course @ NYU, me, Nick, and Amit are building a Book Recommender System, specifically for using with GoodReads. Although the exact details of what we’re doing is yet to be ironed out, we though it was a good idea to get a quick and dirty baseline with RapidMiner. More than a couple of hours later, I can safely say that while we did get a baseline, it ended up more work than intended.
read moreRepresentation Learning: An Introduction
24 February 2018
Representation Learning is a relatively new term that encompasses many different methods of extracting some form of useful representation of the data, based on the data itself. Does that sound too abstract? That’s because it is, and it is purposefully so. Feature learning, Dimensionality reduction, Data reduction, and even Matrix Factorization are all parts of Representation Learning.
read moreHello World
17 February 2018
Welcome to the blog. This blog will house my adventures through grad school and beyond. The contents of this blog will therefore (presumably) be about Data Science and Machine Learning, though it is very much possible that I might add in some other random stuff.
read more