The Future of Enterprise Analytics

Over the last couple weeks since the 2016 Hadoop Summit in San Jose, eSage Group has been discussing the future of big data and enterprise analytics.  Quick note – Data is data and data is produced by everything, thus big data is really no longer an important term.

hspeopleeSage Group is specifically focused on the tidal wave of sales and marketing data that is being collected across all channels, to name a few:

  • Websites – Cross multiple sites, Clicks, Pathing, Unstructured web logs, Blogs
  • SEO –  Search Engine, Keywords, Placement, URL Structure, Website Optimization
  • Digital Advertising – Format, Placement, Size, Network
  • Social
    • Facebook – Multiple pages, Format (Video, Picture, GIF), Likes (now with emojis), Comments, Shares, Events, Promoted, Platform (mobile, tablet, PC) and now Facebook Live
    • Instagram – Picture vs Video, Follows, Likes, Comments, Reposts (via 3rd Party apps), LiketoKnow.it, Hashtags, Platform
    • Twitter – Likes, RT, Quoted RT, Promoted, Hashtags, Platform
    • SnapChat – Follows, Unique views, Story completions, Screenshots.  SnapChat to say the least is still the wild west as to what brands can do to engage and ultimately drive behavior.

Then we have Off-Line (Print, TV, Events,  etc). Partners. 3rd Party DataDon’t get me started on International Data. 

Tired yet?

blog

While sales and marketing organizations see the value of analytics, they are hindered by what is accessible from the agencies they work with and by the difficulty of accessing internal siloed data stored across functions within the marketing organization – this includes central corporate marketing, divisional/product groups, field marketing, product planning, market research and operations.

Marketers are hindered by access to the data and the simple issue of not knowing what data is being collected.  Wherever the data lies, it is often controlled by a few select people that service the marketers and don’t necessary know the value of the data they have collected.  Self-service and exploration is not possible yet.

Layer on top this the fact that agile marketing campaigns require real-time data (at least close real time) and accurate attribution/predictive analytics.

So, you can see there are a lot of challenges that face a marketing team, let alone the deployment of an enterprise analytics platform that can service the whole organization.

Now that I have outlined the business challenges, let’s look at what technologies were mentioned at the 2016 Hadoop Summit that are being developed to solve some of these issues.

  • Cloud, cloud, cloud– lots of data can be sent up, then actively used or sent to cold storage on or off prem.  All the big guys have the “best” cloud platform
  • Security – divisional and function roles, organization position, workflow
  • Self-Service tools – ease of data exploration, visualization, costs
  • Machine Learning and other predictive tools
  • Spark
  • Better technical tools to work with Hadoop, other analytics tools and data stores
  • And much more!  

Next post, we will focus on the technical challenges and tools that the eSage Group team is excited about.

Cheers! Tina

 

 

 

Definitions for “Big Data” – A Starting Point

Big Data

Written by Rob Lawrence, eSage Group’s Strategic Relationship Manager

Will someone please tell us all, once and for all, just what in tarnation is Big Data? What is it? Where is it? Who’s doing what with it? And why are they doing that? In one blog article I can maybe just scratch the surface of those questions. I might even provide some level of understanding for those curious marketers, bewildered and attempting to make heads or tails of the concept of Big Data. I could certainly dive deeper than even that because I’ve spent some time with this, and done homework, and lived Big Data. But this is a blog article, not a dissertation, so I’ll keep it at a 10,000 foot view of the ever elusive, yet intriguing, Big Data!

If you are one of the rare data scientists that have graduated recently from one of few schools offering Big Data degrees, which makes you an expert in this field, please feel free to stop reading here, or continue on to better understand what the rest of us are, well, trying to grasp when it comes to Big Data. For the rest of us, here is my take on the whole Big Data craze:

Big Data is simply all the data available. That means, in realistic terms, all of the data one can gather about a subject from all the places data resides: data sitting in some long forgotten enterprise software program in the basement of a large corporation, data from social media websites, website traffic data (click-through’s and pathing and such), text from blogs, even data from a sensor on a rocket ship or bridge in Brooklyn (not sure if they’re using sensor data on the Brooklyn Bridge, but they could be). Sources of data are vast, and growing. It’s cheaper to store data than ever before, and we now have the computing capability to sift through it, so now there is lots more data being collected, “Big” amounts of Data are being stored and analyzed. There is a lot you can do with all this Big Data, but this is where it gets dicey. You can collect all kinds of data with one subject, question or problem in mind, but end up realizing (through analysis) more important information about a totally different subject, question or problem. That’s why Big Data is so confusing to lots of folks just getting their hands dirty with it, and apparently also why it is so valuable to Marketers, Engineers, CEO’s, The FBI, Data Geeks, and anyone else interested in edging out the competition. Let’s explore some basics:

Wikipedia says: “Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. With this difficulty, new platforms of “big data” tools are being developed to handle various aspects of large quantities of data.”

The Big Data Institute says: “Big Data is a term applied to voluminous data objects that are variety in nature – structured, unstructured or a semi-structured, including sources internal or external to an organization, and generated at a high degree of velocity with an uncertainty pattern, that does not fit neatly into traditional, structured, relational data stores and requires strong sophisticated information ecosystem with high performance computing platform and analytical capabilities to capture, process, transform, discover and derive business insights and value within a reasonable elapsed time.”

So, we’ve only scratched the surface of truly understanding what Big Data is here in this blog, and really the multitude of possibilities Big Data represents has only begun to unfold to those of us using it to better understand whatever it is we’re collecting data about. I hope at a minimum by reading this you have gained a better understanding of what “Big Data” is, but moreover, a curiosity to learn more and perhaps even apply it to something you are working on. These are exciting times whether you are using data for marketing or designing a new rocket ship to explore Mars. Big things are coming, and it’s all due to Big Data!

Here are some great articles I’ve recently enjoyed regarding Big Data:

Saffron is more than just a spice!

panoramaLast night was the 8th eSage Group co-sponsored Seattle Scalability MeetUp hosted at WhitePages.com. There were about 130 people in attendance to hear about HBase and Saffron. Very cool stuff!! Here is the SlideShare.

Summary:

Nick Dimiduk from Hortonworks, the father of HBase, gave us a sneak peek at what’s in store for the developer using HBase as a backing datastore for web apps. He reviewed the standard HBase client API before going into a framework architecture that makes HBase development more like other frameworks designed for developer productivity. He then went over fundamentals like rowkey design and column family considerations and also dug into how to tap coprocessors to add functionality to apps that otherwise might normally be overlooked.

Nick’s Bio: Nick Dimiduk is an engineer and hacker with a respect for customer-driven products. He started using HBase before it was a thing, and co-wrote HBase in Action to share that experience. He studied Computer Science & Engineering at The Ohio State University, specifically programming languages, and artificial intelligence.

Paul Hofmann from Saffron gave a talk titled “Sense Making And Prediction Like The Human Brain.” It was an amazing presentation on machine learning and predictive analytics. Cool stuff!!

Abstract of Paul’s talk: There is growing interest in automating cognitive thinking, but can machines think like humans? Associative memories learn by example like humans. We present the world’s fastest triple store -SaffronMemory Base- for just in time machine learning. Saffron Memory Base uncovers connections, counts and context in the raw data. It builds out of the box a semantic graph from hybrid data sources. Saffronstores the graph and its statistics in matrices that can be queried in real time even for Big Data. Connecting the DotsWe demonstrate the power of entity rank for real time search by the example of the London Bomber and Twitter sentiment analysis. Illuminating the Dots We show the power of Saffron’s model free approach for pattern recognition and prediction on a couple of real world examples like Boeing’s use case of predictive maintenance for aircraft and risk prediction at The Bill and Melinda Gates Foundation.

Pauls Bio: Dr. Paul Hofmann is an expert in AI, computer simulations and graphics. He is CTO of Saffron Technology, a Big Data predictive analytics firm named top 5 coolest vendors in Enterprise Information Management by Gartner. Before joining Saffron, Paul was VP of Research at SAP Labs in Silicon Valley. He has authored two books and numerous publications. Paul received his Ph.D. in Physics at the Darmstadt University of Technology.

Make sure to put April 24th for the next Scalability MeetUp at RedFin.

eSage is co-sponsoring the Seattle Scalability MeetUp

Seattle Scalability MeetUp
The group listening to the presentation. Thank you Microsoft for hosting us!
Post MeetUp Social sponsored by eSage. It was a pretty darn good turn out. About 35-40 people attended!

eSage is in its second month of being the host of the post-MeetUp “MeetUp” for the Seattle Scalability MeetUp.  It is a time where attendees can chat casually about all things Big Data and enjoy a beverage on eSage.

We are excited to be supporting the Hadoop community in Seattle in a fun way!

The Seattle Scalability MeetUp a group of folks who use/are interested in scalable computing technologies, mostly Hadoop, HBase, and NoSQL platforms.

They have had attendees and speakers from Amazon, Facebook, Microsoft, Visible Technologies, Drawn to Scale, U.S. National Labs, and many more!

Groups are usually 75-100 attendees.

Usually they have:

  • 1 or 2 ~20 minute “Feature” presentations
  • Up to 4 “lighting talks”
  • Friendly and helpful group discussion
  • And Pizza!!
Hortonworks provided the pizza!

They are going to start rotating the location between Seattle and the Eastside.

If you would like more information or have a suggestion on a topic, email Tina at tinam (at) esagegroup (dot) com and she will pass them along to the organizers.

Starting With Hive

Raul at St. Paddy's Day Run
Raul Overa

By Raul Overa, Software Engineer

So you have Big Data stored in Hadoop and want to make it accessible to Non-Java programmers?  Hives lets you access your data without the need to create Map Reduce jobs .  They let you access your data with SQL-like language and takes cares of translating it into Map Reduce jobs, so if you already know SQL, you can start using Hive almost immediately.

Now if you have Hive installed and configured, there is a couple of small steps you need to take in order to be able to extract data with SQL-like queries.  Click here for the full  Starting with Hive article.

Tech Talk with J’son

Server RoomWe are going to feature a technical article once a week by eSage’s very own,J’son Cannelos, Partner and Principal Architect.  Check back every week for another Tech Talk with J’son.  Have a question?  Post a comment and he will be happy to answer.

 

 

 

Hadoop and a Beginning MapReduce Program

With all the hoopla about Hadoop lately, I’d like to discuss some of the components of MapReduce and how they are used to parse and process unstructured data in HDFS (Hadoop Distributed File System). Today, I will be discussing just the beginning of how a MapReduce program is built and run.

MapReduce is the primary Apache interface and programming system for processing data stored in HDFS. Unstructured data, like web logs, go in one end and data with more meaning (ahem, structure), come out the other. Many details that would have to be coded and accounted for manually, such as retrieving data from the correct HDFS node and uncompressing input data, are handled behind the scenes for you so you can focus on what you really want to do with your data. Even recent Apache toolsets like Pig and Hive, which make this type of processing available to the non-Java set, translate their scripts into MapReduce behind the scenes to crunch your data.

Building a MapReduce program begins by first declaring a class that inherits from the org.apache.hadoop.conf.Configured class:

imports org.apache.hadoop.conf;

imports org.apache.hadoop.util;

public class BeingMapReduce extends Configured implements Tool

{

@Override

public int run(String[] args) throws Exception {

//error handling if you are expecting certain # of args (return -1)

//JobConf setup below

JobClient.runJob(conf);

return 0;

}

public static void main(String[] args) throws Exception {

int retVal = ToolRunner.run(new BeginMapReduce(), args);

System.exit(retVal);

}

}

Tool is a helper interface. Along with ToolRunner, it helps ensure that all the default Hadoop arguments are used and allows you to concentrate on any custom arguments that you would like to set at runtime.

Configured is the main door into a MapReduce program and gives you access to the all-important JobConf object. This is the main configuration object for your “job”. Here you will define the classes that represent the Mapper and the Reducer (and Combiner, Partitioner, et all – more on those in another blog post). You can get a default JobConf by calling the following:

JobConf conf = new JobConf(getConf(), getClass());

If you had a lot of MapReduce programs that use the same settings, getConf() could actually be fetched from a static class. Since most of our MR programs use the same input / output format classes and arguments, we have a separate jar called Commons that simply hands us a JobConf with most of the arguments set for us.

The JobConf object is also where you specify just where your input data resides, where you want the output data to go when finished, and what form the data is in. A basic setup:

FileInputFormat.addInputPath(conf, new Path(“/data/moredata/20110404/*.log”));

FileInputFormat.addOutputPath(conf, new Path(“/data/outputs”));

The hardcoded paths I used above point to where the data is located in a typical HDFS setup. A more flexible option would be to make these variables (arguments) that are passed in off the command line (args[0], args[1], et). While it’s not very useful, the above code will pretty much run even without specifying a Mapper and Reducer class! That’s because ToolRunner has specified default Mapper and Reducer classes for us – IdentityMapper.class and IdentityReducer.class. More on these and how they work in a future posting.

That’s it for today. I hope this helps get you started in your exploration of Hadoop and MapReduce!

www.esagegroup.com