Posted by J Singh
| Filed under
Clustering,
Locality Sensitive Hashing,
OpenLSH,
Big Data,
K-Means
Join Us for a discussion on Clustering Big Data
When: Thursday June 4, noon – 1:00 pm EST.
Where: (Virtual Meeting)
Contact Us for coordinates.
Description: Approximate Nearest Neighbor methods for clustering and indexing have been actively researched ever since the K-Means algorithm was published in 1975 (and coded in FORTRAN). A
recent book lists about 300 variants and related topics.
The 50th Anniversary issue of Communications of the ACM in 2008 cited two pieces of "Breakthrough Research". One was MapReduce, the other was clustering based on Locality Sensitive Hashing (LSH). Locality Sensitive Hashing is for sets of large data and alleviates many of the issues seen with k-means. Want to see if a body of code has remarkable similarity to a public github repo? Want to see "similar" fragments of DNA that are common between several species? LSH will get you there faster than most other techniques.
The talk will demonstrate
OpenLSH, an open source implementation of LSH we have been working on.
Speaker Bios: Dr. J Singh is a Principal at DataThinks.org, a Cloud and Big Data consulting company. He is a frequent speaker on NoSQL, Hadoop--Map/Reduce and analytics of social media. He is the originator ...
Read more |
Comments |
Thu 28 May 2015
Posted by J Singh
| Filed under
clustering,
k-means,
lsh,
data mining ,
locality sensitive hashing,
pattern matching,
data analytics
Presentation at
Pivotal IO Meetup in New York
Read more |
Comments |
Tue 17 March 2015
Posted by J Singh
| Filed under
analytics,
facebook,
mapreduce,
elastic map reduce
There is a wealth of information tied up in social media and it is up to your Marketing organization to unlock its potential.
We recently ran a workshop on Facebook Analytics using Elastic Map/Reduce. In case you were unable to attend, slides from that presentation:
Read more |
Comments |
Sun 13 January 2013
Posted by J Singh
| Filed under
HBase,
Google App Engine,
Hadoop
So you want to get going on having your organization come up to speed on Big Data. What next?
Your first step could be to set up a sandbox for the participants. We set up ours on Amazon EC2 because Amazon
hosts a number of public data sets, as do InfoChimps and others. These data sets are convenient for initial experimentation.
It turns out that getting access to the data where you can touch it and feel it is the easy part. Christopher Miles takes an inspiring stab at the problem in his blog post
All Your HBase Are Belong to Clojure.
Our approach is a bit different,
using Google App Engine's Map Reduce to let us focus on the task at hand. This approach doesn't work for all situations but can shave valuable time off the learning curve in some cases.
I had occasion to develop this idea further at a recent talk to the Google App Engine group. Slides from that talk:
Read more |
Comments |
Mon 19 March 2012
Posted by J Singh
| Filed under
Map Reduce,
Hadoop,
NoSQL
Here's a more detailed outline of
my talk on March 12.

To make the talk more relevant to you, if you have a use case you'd like me to discuss, we'd love to hear about it, and possibly incorporate it into the talk. Join us for ... (see the end of this post).
If you came here looking for the presentation,
here it is.
Introduction
- What Hadoop is, and what it's not
- Origins and History
- Hello Hadoop, how to get started.
The Hadoop Bestiary
- Core: Hadoop Map Reduce and Hadoop Distributed File System (HDFS)
- Data Access: HBase, Pig and Hive
- Algorithms: Mahout
- Data Import: Flume, Sqoop and Nutch
The Hadoop Providers
- Apache
- Cloudera
- What to do if your data is in a database
The Hadoop Alternatives
- Amazon EMR
- Google App Engine
For those that weren' t able to attend, here is the presentation:
Read more |
Comments |
Thu 8 March 2012
Posted by J Singh
| Filed under
Amazon EC2,
Hadoop
This post begins a series of exercises on Hadoop and its ecosystem. Here is a rough outline of the series:
- Hello Hadoop World, including loading data into and copying results out of HDFS.
- Hadoop Programming environments with Python, Pig and Hive.
- Importing data into HDFS.
- Importing data into HBase.
- Other topics TBD.
Hello Hadoop World
We build this first environment on Amazon EC2. We begin with ami-e450a28d, an EBS-based 64-bit ubuntu machine that's pretty bare-bones. To run Hadoop, we need at least an m1.large instance — using a micro instance is just too painful.
For the actual exercise, we will use the excellent
tutorial by Michael Noll. Michael first wrote it in 2007 and has kept it up to date. The tutorial helps you install map/reduce and use it for computing the count of every word in a large text. Michael's post is pretty much right on. Things to keep in mind:
Read more |
Comments |
Tue 28 February 2012
Posted by J Singh
| Filed under
ETL process,
Hadoop,
Big Data,
Data Warehouse
Have you experienced this?
- Your data warehouse takes about 8 hours to load the cube. The cube is updated weekly; 8 hours per week is a small price to pay and everyone is happy.
- It is missing a few data elements that another group of analysts really needs.
You add those and now the cube takes 9 hours to load. No big deal.
- Time passes, the business is doing well, the size of the data quadruples. It now takes 36 hours per week to load the cube.
- Time passes, some of the data elements are not needed any more but it is too hard to take them out of the process — it continues to take 36 hours.
- You add yet another group of analysts as user, a few more data elements, it now takes 44 hours per week!
- You get the picture… the situation gets more and more precarious over time.
We hope the dam holds!
Here's
an idea from Read Write Web:
One example … was a credit card company working to implement fraud detection functionality. A traditional SQL data warehouse is more than likely already in place, and it may work well enough but without enough granularity for ...
Read more |
Comments |
Tue 31 January 2012
Posted by J Singh
| Filed under
Big Data

Curt Monash published a
trio of articles about dirty data last summer where he cited use cases for storing dirty data even when the signal-to-noise ratio is pretty low. He had this observation about the value of the data in even in cases where the signal-to-noise ratio is low:
Intelligence work is one case where the occasional black swan might justify gilded cages for the whole aviary; the same might go for other forms of especially paranoid security.
Big Data can be dirty and the collection process must pay attention to data quality. This is different from the type of data we are used to in your typical database, with all the focus on data integrity. The point is that even data with very low integrity has valuable nuggets of information.
Whatever happened to data integrity? The RDBMS people mastered the art of data integrity long ago and along come these NoSQL folks who just ignore all that learning. This data is junk!
Human input is often a source of noise in the data. Big data is used for analyzing Twitter feeds, blog posts, comment forums, chat sessions and the like and text data can be ambiguous. It's context ...
Read more |
Comments |
Mon 30 January 2012
Posted by J Singh
| Filed under
Executive Briefing
We will be giving away one free ticket to the executive briefing. To be entered into the drawing for it, please let us know why you would like to attend and what you expect to get out of the briefing.
The winner will be drawn at random from the attendees and will be announced at the briefing. To be considered for the free ticket, please submit your response by February 26, 2012.
Read more |
Comments |
Mon 23 January 2012
Posted by J Singh
| Filed under
Executive Briefing,
Big Data,
NoSQL,
Map/Reduce
DataThinks' Executive Briefing on Big Data, NoSQL and Data Analytics is scheduled for March 1, 2012. Signup Information below.
These are are some of the topics we intend to cover in that briefing.
- What is Big Data and what problems does it solve? What opportunities does it present that weren't available before?
- These terms are often seen in articles about Big Data: NoSQL, Map Reduce, Analytics, Schema-less Databases. How do they relate to each other and how do they differ?
- What problems you are better off solving with traditional database solutions?
- The various types of NoSQL databases and what are they each appropriate for?
- Key-value stores (e.g., Riak, Redis, Voldemort, Tokyo Cabinet)
- Document stores (e.g., MongoDB, CouchDB)
- Wide column stores (e.g., Cassandra, Hypertable)
- Graph Databases (e.g., Neo4J)
- Analysis and Visualization
- What is Map / Reduce and the opportunities it presents for your business.
- The cost / performance trade-offs of running Map Reduce in the cloud, on Amazon EC2 or on Google, or in-house.
Any additional topics you would like to see covered? Please leave a comment.
The event is
closed but corporate briefings based on this material are available. Please
contact us to arrange.
Read more |
Comments |
Wed 14 December 2011
Older