Doruk's Sporadic Musings

a collection of posts about Data Science and other stuff

GPU Accelerated Matrix Factorization for Recommender Systems

21 December 2018

Matrix Factorization (MF) is a popular algorithm used to power many recommender systems. Efficient and scalable MF algorithms are essential in order to train on the massive datasets that large scale recommender systems utilize. This blog post presents cu2rec, a matrix factorization algorithm written in CUDA. With a single NVIDIA GPU, cu2rec can be 10x times faster than state of the art sequential algorithms while reaching similar error metrics.

read more

Running Jupyter Notebook on NYU HPC in 3 Clicks

18 November 2018

I’ve been lately getting a lot of question about my setup for NYU’s High Performance Computing cluster and how it can be made more convenient to use. In this post, I’m going to be sharing my tips and tricks for increasing the productivity when working with remote clusters that use the Slurm Workload Manager, such as NYU’s Prince cluster.

read more

Hierarchical Clustering and its Applications

26 October 2018

Clustering is one of the most well known techniques in Data Science. From customer segmentation to outlier detection, it has a broad range of uses, and different techniques that fit different use cases. In this blog post we will take a look at hierarchical clustering, which is the hierarchical application of clustering techniques.

read more

Recommender Systems: From Filter Bubble to Serendipity

09 October 2018

Recommender systems power a lot of our day to day interactions with the content we see on the internet. With over 2.5 quintillion bytes of data created each day, the last two years alone make up 90% of the data in the world [1]. We produce content at a level that is simply impossible to consume in one lifetime, and that makes recommender systems inevitable. However, as Uncle Ben said, with great power comes great responsibility. Here I talk about some of the practical and ethical problems that recommender systems raise, and how we can go about solving them.

read more

Representation Learning through Matrix Factorization

10 September 2018

Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) has been around for a while, and have been successfully utilized for learning intermediary representations of data for quite some time. This post will be recap on what they actually do, and how they work.

read more

Why you should use PCA before Decision Trees

11 August 2018

Dimensionality Reduction techniques have been consistently useful in Data Science and Machine Learning. It can reduce training times, allow you to remove features that do not hold any predictive value, and it even works for noise reduction. In this blog post, we are going to focus on why it might even make your classifier perform better.

read more

Introducing Books2Rec

14 May 2018

I am proud to announce Books2Rec, the book recommendation system I have been working for the last couple of months, is live. Using your Goodreads profile, Books2Rec uses Machine Learning methods to provide you with highly personalized book recommendations. Don’t have a Goodreads profile? We’ve got you covered - just search for your favorite book.

read more

Data Science workflow using Ubuntu Subsystem on Windows

06 May 2018

Microsoft’s latest push for bringing developers to Windows comes in the form of embracing Linux as part of their system. Windows Subsystem for Linux, also known as WSL, has been around for over a year now. After getting fed up with using Linux VMs for development environment, and later getting fed up with having to switch operating systems in a dual-boot config, I was ready to try it. I’m very glad that I did.

read more

Adventures with RapidMiner

01 March 2018

For our Big Data Science course @ NYU, me, Nick, and Amit are building a Book Recommender System, specifically for using with GoodReads. Although the exact details of what we’re doing is yet to be ironed out, we though it was a good idea to get a quick and dirty baseline with RapidMiner. More than a couple of hours later, I can safely say that while we did get a baseline, it ended up more work than intended.

read more