Saffron is more than just a spice!

panoramaLast night was the 8th eSage Group co-sponsored Seattle Scalability MeetUp hosted at WhitePages.com. There were about 130 people in attendance to hear about HBase and Saffron. Very cool stuff!! Here is the SlideShare.

Summary:

Nick Dimiduk from Hortonworks, the father of HBase, gave us a sneak peek at what’s in store for the developer using HBase as a backing datastore for web apps. He reviewed the standard HBase client API before going into a framework architecture that makes HBase development more like other frameworks designed for developer productivity. He then went over fundamentals like rowkey design and column family considerations and also dug into how to tap coprocessors to add functionality to apps that otherwise might normally be overlooked.

Nick’s Bio: Nick Dimiduk is an engineer and hacker with a respect for customer-driven products. He started using HBase before it was a thing, and co-wrote HBase in Action to share that experience. He studied Computer Science & Engineering at The Ohio State University, specifically programming languages, and artificial intelligence.

Paul Hofmann from Saffron gave a talk titled “Sense Making And Prediction Like The Human Brain.” It was an amazing presentation on machine learning and predictive analytics. Cool stuff!!

Abstract of Paul’s talk: There is growing interest in automating cognitive thinking, but can machines think like humans? Associative memories learn by example like humans. We present the world’s fastest triple store -SaffronMemory Base- for just in time machine learning. Saffron Memory Base uncovers connections, counts and context in the raw data. It builds out of the box a semantic graph from hybrid data sources. Saffronstores the graph and its statistics in matrices that can be queried in real time even for Big Data. Connecting the DotsWe demonstrate the power of entity rank for real time search by the example of the London Bomber and Twitter sentiment analysis. Illuminating the Dots We show the power of Saffron’s model free approach for pattern recognition and prediction on a couple of real world examples like Boeing’s use case of predictive maintenance for aircraft and risk prediction at The Bill and Melinda Gates Foundation.

Pauls Bio: Dr. Paul Hofmann is an expert in AI, computer simulations and graphics. He is CTO of Saffron Technology, a Big Data predictive analytics firm named top 5 coolest vendors in Enterprise Information Management by Gartner. Before joining Saffron, Paul was VP of Research at SAP Labs in Silicon Valley. He has authored two books and numerous publications. Paul received his Ph.D. in Physics at the Darmstadt University of Technology.

Make sure to put April 24th for the next Scalability MeetUp at RedFin.

Tech Talk Thursday – Remove Table References in Hive ORDER BY clause

By J’son Cannelos – Partner / Principal Architect, eSage Group

“In God we trust; all others pay cash.”  

– Bob French, New Orleans Tuxedo Jazz Musician (1938 – 2012)

This fairly simple Hive issue was driving me nuts for a while, so I wanted to get it out to the blog while its still fresh on my mind.

Take the following innocent Hive query:

select distinct s.date_local, s.user_id from slice_played s Where LENGTH(s.user_id) > 0 and s.date_local >= ‘2012-10-07’ and s.date_local <= ‘2012-10-08’ order by s.date_local desc limit 150;

Time and time again this would return:

Error in semantic analysis. Invalid table alias or column reference s

After removing each piece of the query, it turns out that the culprit was the ORDER BY clause. This piece is seems to be illegal.

order by s.date_local

Why you ask? Because, apparently, Hive doesn’t allow table references in the ORDER BY clause! Ack!

The solution is pretty simple, but not intuitive. You need to either a) remove the table reference in the fields in your ORDER BY clause or b) alias the columns you would like to use in the order by clause. Here is the corrected Hive query that works:

select distinct s.date_local as date_pacific, s.user_id from slice_played s Where LENGTH(s.user_id) > 0 and s.date_local >= ‘2012-10-07’ and s.date_local <= ‘2012-10-08’ order by date_pacific desc limit 150;

I’ve fell into this trap several times now. In our Hive implementation, we pretty much force strict mode (hive.mapred.mode = strict), so we have to alias tables, use existing partitions in the WHERE clause, et.

According to this JIRA link (https://issues.apache.org/jira/browse/HIVE-1449), it’s a known issue. It just says that table references are a no-no, so you don’t need to really alias your columns, however; column aliases seem safer to me. I could just as easily be joining to several tables with a “date_local” column.

Hope this helps and happy coding!
Sincerely,

J’son

Tech Talk Thursday – SSRS ReportViewer Chart Caching Issue – Resolved!

I’ve wanted to write about this for some time, as it was a pretty tough challenge.

First, some background: A couple years ago, I built a custom SSRS Parameter Viewer Control to replace the one that ships with ASP.NET ReportViewer control. The default prompt area of the ReportViewer was not very visually appealing and selection changes to the parameters themselves often caused full postbacks! L I understand that’s been fixed now with SSRS 2008/R2, but it is still lacking in text lookup and other features that I have since added to my custom control.

This is a pretty long post, so I PDF’d it:  SSRS ReportView Chart Caching Issue – Resolved!

eSage is co-sponsoring the Seattle Scalability MeetUp

Seattle Scalability MeetUp
The group listening to the presentation. Thank you Microsoft for hosting us!
Post MeetUp Social sponsored by eSage. It was a pretty darn good turn out. About 35-40 people attended!

eSage is in its second month of being the host of the post-MeetUp “MeetUp” for the Seattle Scalability MeetUp.  It is a time where attendees can chat casually about all things Big Data and enjoy a beverage on eSage.

We are excited to be supporting the Hadoop community in Seattle in a fun way!

The Seattle Scalability MeetUp a group of folks who use/are interested in scalable computing technologies, mostly Hadoop, HBase, and NoSQL platforms.

They have had attendees and speakers from Amazon, Facebook, Microsoft, Visible Technologies, Drawn to Scale, U.S. National Labs, and many more!

Groups are usually 75-100 attendees.

Usually they have:

  • 1 or 2 ~20 minute “Feature” presentations
  • Up to 4 “lighting talks”
  • Friendly and helpful group discussion
  • And Pizza!!
Hortonworks provided the pizza!

They are going to start rotating the location between Seattle and the Eastside.

If you would like more information or have a suggestion on a topic, email Tina at tinam (at) esagegroup (dot) com and she will pass them along to the organizers.

Starting With Hive

Raul at St. Paddy's Day Run
Raul Overa

By Raul Overa, Software Engineer

So you have Big Data stored in Hadoop and want to make it accessible to Non-Java programmers?  Hives lets you access your data without the need to create Map Reduce jobs .  They let you access your data with SQL-like language and takes cares of translating it into Map Reduce jobs, so if you already know SQL, you can start using Hive almost immediately.

Now if you have Hive installed and configured, there is a couple of small steps you need to take in order to be able to extract data with SQL-like queries.  Click here for the full  Starting with Hive article.

Thursday Tech Talk with J’son

Setting Up Excel Services in SharePoint 2010 for Testing

May 1, 2012
By J’son Cannelos, Partner/Principal ArchitectExcel Services

Microsoft SharePoint 2010 is here to stay. According to global360.com, 67% of companies participating in a recent survey reporting deploying SharePoint in an enterprise environment. Managing document workflows and attaching content to business processes were reported as the highest priorities were given as reasons for using SharePoint. At the same time, Microsoft Office Excel is the de-facto tool for data analysis and enablement. Everyone from the company CEO on down to the accountant uses Excel today for examining their present situation and forecasting the future. Could web enabling Excel via SharePoint be far behind?

Excel Services has been around since SharePoint 2007; however it’s made a big leap in SharePoint 2010. A good write up on Excel Services for SharePoint 2010 is located here (yep, you even get Slicers!):

http://blogs.office.com/b/microsoft-excel/archive/2009/11/11/excel-services-in-sharepoint-2010-dashboard-improvements.aspx

The service allows Excel spreadsheets to be presented in a web browser using a slick Excel-like interface. External data connections, workbook calculations, user defined functions, and charts are supported out of the box for a true near desktop experience. Business stakeholders love and need Excel? Check. They need to access Excel workbooks and reports anytime, anywhere, even on a computer without Microsoft Office? Check!

Continue reading here.

Tech Talk with J’son

Server RoomWe are going to feature a technical article once a week by eSage’s very own,J’son Cannelos, Partner and Principal Architect.  Check back every week for another Tech Talk with J’son.  Have a question?  Post a comment and he will be happy to answer.

 

 

 

Hadoop and a Beginning MapReduce Program

With all the hoopla about Hadoop lately, I’d like to discuss some of the components of MapReduce and how they are used to parse and process unstructured data in HDFS (Hadoop Distributed File System). Today, I will be discussing just the beginning of how a MapReduce program is built and run.

MapReduce is the primary Apache interface and programming system for processing data stored in HDFS. Unstructured data, like web logs, go in one end and data with more meaning (ahem, structure), come out the other. Many details that would have to be coded and accounted for manually, such as retrieving data from the correct HDFS node and uncompressing input data, are handled behind the scenes for you so you can focus on what you really want to do with your data. Even recent Apache toolsets like Pig and Hive, which make this type of processing available to the non-Java set, translate their scripts into MapReduce behind the scenes to crunch your data.

Building a MapReduce program begins by first declaring a class that inherits from the org.apache.hadoop.conf.Configured class:

imports org.apache.hadoop.conf;

imports org.apache.hadoop.util;

public class BeingMapReduce extends Configured implements Tool

{

@Override

public int run(String[] args) throws Exception {

//error handling if you are expecting certain # of args (return -1)

//JobConf setup below

JobClient.runJob(conf);

return 0;

}

public static void main(String[] args) throws Exception {

int retVal = ToolRunner.run(new BeginMapReduce(), args);

System.exit(retVal);

}

}

Tool is a helper interface. Along with ToolRunner, it helps ensure that all the default Hadoop arguments are used and allows you to concentrate on any custom arguments that you would like to set at runtime.

Configured is the main door into a MapReduce program and gives you access to the all-important JobConf object. This is the main configuration object for your “job”. Here you will define the classes that represent the Mapper and the Reducer (and Combiner, Partitioner, et all – more on those in another blog post). You can get a default JobConf by calling the following:

JobConf conf = new JobConf(getConf(), getClass());

If you had a lot of MapReduce programs that use the same settings, getConf() could actually be fetched from a static class. Since most of our MR programs use the same input / output format classes and arguments, we have a separate jar called Commons that simply hands us a JobConf with most of the arguments set for us.

The JobConf object is also where you specify just where your input data resides, where you want the output data to go when finished, and what form the data is in. A basic setup:

FileInputFormat.addInputPath(conf, new Path(“/data/moredata/20110404/*.log”));

FileInputFormat.addOutputPath(conf, new Path(“/data/outputs”));

The hardcoded paths I used above point to where the data is located in a typical HDFS setup. A more flexible option would be to make these variables (arguments) that are passed in off the command line (args[0], args[1], et). While it’s not very useful, the above code will pretty much run even without specifying a Mapper and Reducer class! That’s because ToolRunner has specified default Mapper and Reducer classes for us – IdentityMapper.class and IdentityReducer.class. More on these and how they work in a future posting.

That’s it for today. I hope this helps get you started in your exploration of Hadoop and MapReduce!

www.esagegroup.com