Clustering Big Data
Posted by J Singh | Filed under Clustering, Locality Sensitive Hashing, OpenLSH, Big Data, K-Means
Join Us for a discussion on Clustering Big Data
When: Thursday June 4, noon – 1:00 pm EST.
Where: (Virtual Meeting) Contact Us for coordinates.
Description: Approximate Nearest Neighbor methods for clustering and indexing have been actively researched ever since the K-Means algorithm was published in 1975 (and coded in FORTRAN). A recent book lists about 300 variants and related topics.
The 50th Anniversary issue of Communications of the ACM in 2008 cited two pieces of "Breakthrough Research". One was MapReduce, the other was clustering based on Locality Sensitive Hashing (LSH). Locality Sensitive Hashing is for sets of large data and alleviates many of the issues seen with k-means. Want to see if a body of code has remarkable similarity to a public github repo? Want to see "similar" fragments of DNA that are common between several species? LSH will get you there faster than most other techniques.
The talk will demonstrate OpenLSH, an open source implementation of LSH we have been working on.
Speaker Bios: Dr. J Singh is a Principal at DataThinks.org, a Cloud and Big Data consulting company. He is a frequent speaker on NoSQL, Hadoop--Map/Reduce and analytics of social media. He is the originator of OpenLSH, an open source project for Locality Sensitive Hashing, the technique discussed in this talk. Dr. Singh is an Adjunct Professor of Computer Science at WPI where he covers Database Technology, Big Data and Data Mining. Dr. Singh has been the organizer of the Boston Cloud Services meetup group. He received his MS in Pattern Recognition and PhD in Numerical Algorithms from Syracuse University.
Teresa Nicole Brooks is a software engineer with a passion for all things data. She has both professional and academic experience in natural language processing, information retrieval, big data processing and search. Her interest in web-scale data mining and search drives her interest in near duplicate document detection and locality sensitive hashing. Some of her technical interests are network and software security, artificial intelligence, knowledge extraction, and recommendation systems.
Teresa received a Masters degree in Computer Science from Pace University in 2010. While at Pace, she successfully published a graduate thesis for MyFido, an intelligent RSS Feed aggregator. MyFido uses natural language processing and other artificial intelligence techniques to make suggestions based on passive and non-passive observed user interests.
Teresa currently works for Xero, Inc's “Fringe” team in NYC. She lives with her cat Molly and dog Rondo.
Previous Post