I science data and hack machine learning code in equal measure, and I absolutely enjoy doing these; code in R, Python, starting to code in Golang, want to code in Scala and Erlang. I prefer functional programming over OOP.
I get a twisted sort of pleasure from cleaning untidy datasets and enjoy imputing missing values more than I care to admit :)
I'm also a grad. student studying the Philosophy of Cognitive Science and Epistemology. One of my long term goals is to merge the advances in machine learning with contemporary debates in Phil. Cog. Sci., to consequently explore the thick of what it means to be artificially intelligent.
At a recent DataKind SF event, I was rather intrigued by the challenges faced in investigating wage theft and other labor violations not just throughout the nation, but also specific to California and the Bay Area regions. At first, and probably like a lot of you, I had little idea what “wage theft” and other labor violations really entailed. Upon some reading, and in correspondence with some very motivated domain experts from the Stanford Center for Integrated Facility Engineering and the San Francisco Dept. of Labor Wage and Hour Division, the extent and impact of this problem began to impress upon me. This NY Times article captures the gist of the issue well. Withholding overtime pay, paying below minimum wage, skimming wages off paychecks, violations against under-age workers (child labor!), visa abuse, etc. are all forms of theft and abuse that have a big impact to at-risk workers, their families, and make life harder than it already is.
The Wage and Hour Division (WHD) has done a nice job of summarizing some impact statistics around this problem. Meanwhile, the Department of Labor has put together a dataset of all known (investigated and closed) cases of violations nationwide at this D.O.L. Enforcement portal. The “Wage and Hour Compliance Action Data” is what I’ll be looking at, specifically.
I love cult films! What’s not to love about Sam Jackson on a plane with a bunch of steroid-induced snakes unleashed on passengers? Yeah, the movie’s as weird as it sounds, and yet it leaves me with a yearning to discuss it, to quote it years after its release, develop a weird fascination for Sam Jackson, and makes me curious about [the late] director David Ellis. I feel connected to any discussions around it and enjoy quips that reference scenes and dialogs in the movie. Others like me might even wear paraphernalia from the movie, years later. It’s an attachment – an agglomerate set of feelings and actions – that’s distinct from an appreciation of a really good indie movie.
What then, is a cult movie? Consider some literature from AMC’s filmsite:
Cult Filmshave limited but very special appeal. Cult films are usually strange, quirky, offbeat, eccentric, oddball, or surreal, with outrageous, weird, unique and cartoony characters or plots, and garish sets. They are often considered controversial because they step outside standard narrative and technical conventions. They can be very stylized, and they are often flawed or unusual in some striking way.
Wow, those are a lot of subjective measures. I mean it’s not like I can measure “quirky” in terms of milligrams now is it? Well played, interweb… it seems as if I must outwit you to try and study this phenomenon better; which brings me to the point:
The problem statement! I’ll break it into two parts…
Figure out a good measure of what makes a movie a cult phenomenon. (this post)
Knowing this, can I use a movie’s pre-release information to predict whether or not it will become a cult phenom? (near future)
Last summer (2015), as I put myself through the paces in this brilliant course by one of my personal heroes, Andrew Ng, I grew exceedingly confident about my ability to implement complex machine learning approaches (I blame credit Dr. Ng). Consequently, upon finishing the course, I jumped straight into [what I later realized was] the deep end by signing up for the Metis¹ Naive Bees Classifier challenge, hosted by DrivenData.org² .
Nevertheless, despite the fact that my main intention was just to get my hands dirty with machine learning code, I quickly realized that my approach to training an algorithm to differentiate between the Bees genus was rather, well… naive: I was trying to extract the dominant colors from the training images, using either Principal Components Analysis or K-Means clustering; once done, I wanted to run a classifier on this much smaller subspace of features. This turned out to be an ill-informed strategy – too embarrassed to post the training error – simply because… well, take a look at some of the training images for yourself:
Way too many color variations in background!
Can you spot the Bombus bee? Took me a while!
Becoming harder to tell the Apis apart from Background (colors are pretty varied in background)
A rare close up of the Apis (honey bee) sub species
[Click “Read More” to read how I explored Kmeans clustering on these images.]