Observatorio de CENATIC

Equipo, redes
  • Increase font size
  • Default font size
  • Decrease font size

Open Smart Cities II: Open Source Big Data

E-mail Imprimir PDF

This article is the second post of a series of three that address, from the point of view of open source software, several technological areas related to Smart Cities, for example, the Internet of Things, Cloud, Big Data, or Smart Cities platform of services and applications. We examine the role of open source Big Data technology, as a fundamental tools in Smart City projects.

Read this post in Spanish spain

1. Introduction

What implications does Big Data for cities? How Big Data technologies can help make cities smarter? What role is playing the open source software in the development of Big Data, in the context of the Smart Cities? The data produced by citizenship, systems and objects in the city, are the individual and more scalable resources, available to the stakeholders in the Smart City. This large data set, called Big Data is constantly captured through sensors and open data sources. More and more data services for city officials, utility services, and citizens become available, which allows efficient access and use of big data, a necessary requirement for Smart Cities1.

2. Big Data for Intelligent Cities: a Brief Approach

Gartner defines "Biga data" as a “high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”2.

IBM explains that the concept of Big Data applies to all information that cannot be processed or analyzed using traditional tools and processes. Nevertheless it is important to understand that the conventional databases are important and relevant analytical solutions. Furthermore, this large volume of information exists in a variety of data, which can be represented in various ways around the world, including mobile devices, audio, video, GPS systems, digital sensors in countless industrial equipment, automobiles, electric meters, vanes, anemometers, etc.., which can measure and communicate the position, movement, vibration, temperature, humidity and to chemical changes undergone by the air, so that these data analysis applications require a very quick speed response in order to obtain the right information at the right time.3

Our cities are full of information, generated by heterogeneous sources in different formats, granularity, dynamism and quality. Knowledge of this complex information space is vital to create Smart City services and it is linked not only to technological issues underlying centralization, storage, processing and analysis of information, but also extends to issues such as safety and property of the data generated in the city, interoperability, etc.

3. How are Smart City data?

Ajit Jaokar, founder of Futuretext, in his article "Big Data for Smart Cities- How do we go from Open Data to Big Data for Smart Cities"4 explains how are the data generated by cities. Based on a Barry Devil5 article, published on O'Reilly Strata, in which the author present a Big Data taxonomy, Jaokar classifies the data generated in the Smart City in "hard", "soft" and "compound" data. In the following picture we can see each type of data detaily.bigdata

The bottom pyramid represents city "hard data”. At the first level we have data collected mainly from the physical world, the world of matter measurement data, from sensors, etc. On the second level, we find “atomic data” which are comprised of physical events, meaningfully combined in the context of some human interaction. Finally, at the third level we have data created through mathematical manipulation of "atomic data" and which are generally used to obtain information more meaningful. These types of data are for example, city metrics, metadata or KPIs.

On the other side, the information at the top of the pyramid is the realm of the mind of the Smart City. This is the information coming from human interaction in society, called "soft information" that is less well structured and requiring more specialized statistical and analytical processing.

In this side, at the first level, we find what is denominated by Devil, "Multiplex data", that is, the information generated by human social interaction, for example location data, sensor data from mobile devices, citizens reporting, tagging by citizens. At the second level, we have textual data, for example, Twitter messages. Finally, at the intersection of these two pyramids we find "Compound Data" of the city, which is a combination of hard and soft information that included linked data6, social media data, and structured data. The compound data is the category of data of most current interest in Big Data. It contains much social media information, a combination of hard web log data and soft textual, and multimedia data from sources such as Twitter, Facebook and so on.
To take advantage of all this big amount of information about the city and generating valuable services for its citizens is necessary make Big Data small, that is, make it accessible to citizens7. In this point, is when Open Data concept comes into play, related to Big Data. Open Data are really open if they are accessible, that is, easy to obtain and easy to understand. Therefore, pre-processing, storaging and post-processing of data allowed by Big Data technology are important issues, at the time of deploying Open Data strategies, as part of a Smart City project.

In this article we have not addressed Open Data paradigm (we leave this matter for another post), just to note, according with Gartner, that if Big Data makes intelligent organizations, Open Data makes them rich1, and this fact is a great opportunity for cities, especially in this period of economic crisis.

Under the Smart City framework, data is the basic ingredient the services provided, for this reason it is so important that the development of Open Data strategies for cities do not give the cold shoulder to the data generated by citizens, as in that case we are losing very relevant information and the opportunity to create social wealth, environmental, economic and therefore quality of life.

According with Andrea Di Maio of Gartner, in order to the Smart Cities provides value-added services based on data, is necessary to integrate the management of data generated by social media into Open Data strategies8 and deploy the most appropriate Big Data technologies that fosters its treatment: extraction, standardization, storage, analysis and visualization onto structures that are easily accessible.

4. Big numbers for Big Data

A review of different sources for statistics related to this technology show the great economic, social and in innovation impact underlying Big Data.

Big data will drive $232 billion in spending through 2016 in hardware, software and related services. In 2012, $5.5 billion in new software sales will be driven directly by demands for new big data functionality, and will grow at a rate of over 16% annually through 2016. The sub-segments receiving the biggest Big Data investments are social network analysis and content analytics, with 45% of new spending each year9. This growth is demanding a new kind of professional the Data Scientist10, a rising profession that goes beyond the expert in a data warehouse or business intelligence.
"A Comprehensive List of Big Data Statistics"11 is an excellent collection of statistics from various sources about Big Data that let us understand the magnitude of the phenomenon we are analyzing. Some of the most significant insights are following:
  • 2.7 Zetabytes of data exist in the digital universe today

  • The Obama administration is investing $200 million in big data research projects.

  • IDC Estimates that by 2020,business transactions on the internet- business-to-business and business-to-consumer – will reach 450 billion per day.

  • Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.

  • 94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data.

  • More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide.

  • Decoding the human genome originally took 10 years to process; now it can be achieved in one week.

  • In the developed economies of Europe, government administrators could save more than €100 billion ($149 billion) in operational efficiency improvements alone by using big data, not including using big data to reduce fraud and errors and boost the collection of tax revenues.

  • Poor data across businesses and the government costs the U.S. economy $3.1 trillion dollars a year.

Finally, regarding to Spain, briefly note, according to IDC, Big Data is in a nascent state, with only 4.8% of Spanish companies using these technologies in their business processes. Forecasts indicate that by 2014 its use could grow in our country about 304%. 12

So, what we actually know about Big Data is just the tip of the iceberg of what Gartner says it will be the "new normal". Mark Beyer of Gartner says that by 2020 he capabilities and features of Big Data will be a normal part of the product offering of traditional companies that provide IT solutions.

5. Big Data Open Source Technologies

From a technological point of view, Big Data is synonymous withtechnologies like Hadoop and NoSQL databases, including Mongo (document storage) and Cassandra (databases key/value). Open source software is key in this area. Currentlyin the market there are thousand open source technologies and some of their products are revolutionizing the foundations of Big Data.

From a technological point of view, Big Data is synonymous with technologies like Hadoop and NoSQL databases, including Mongo (document storage) and Cassandra (database key/value). Open source software is key in this area. Currently in the market there are thousand open source technologies and some of these products are revolutionizing the foundations of Big Data.

Below is a selection of more important open source Big Data products. As in the previous article on IoT, this is not an exhaustive list, but a first approximation to show the state of the art of open source technology in the field of Big Data.

5.1 Apache Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop was created by Doug Cutting to support distribution for the Nutch13 search engine project.
Initially the aim was to meet Nutch’s multimachine processing requirements, for which Cutting implemented the computational paradigm MapReduce14, where the application is divided into many small fragments of work, each of which can be run or rerun any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework
Hadoop, that is available under the Apache 2.0 license, is currently one of the most popular Big Data technologies for structured, semi-structured and unstructured data storing. Inspired by Google's MapReduce and Google File System (GFS) is a top-level Apache project being built and used by a global community of contributors, using the Java programming language. Yahoo! has been the largest contributor to the project, and uses Hadoop extensively in its business15 .

More information: http://hadoop.apache.org/

5.2 MapReduce

MapReduce21 is a programming model for processing large data sets, and the name of the implementation of the model by Google. MapReduce is used for distributed computing on clusters of servers.

The name of the framework is inspired by the names of two important methods, or functions, in functional programming: Map and Reduce.

MapReduce has been adopted worldwide as an open source implementation called Hadoop, its development was initially led by Yahoo (in the 2010s it is by the Apache project). MapReduce libraries have been written in many programming languages. such as C + +, Java, Python.

More information: http://research.google.com/archive/mapreduce.html and http://www.youtube.com/watch?v=8wjvMyc01QY

5.3 Storm

Storm is a system of real-time distributed computing, open source and free, born into Twitter. Storm makes it easy to reliably process unstructured data flows in the field of real-time processing, which made Hadoop for batch processing.

Storm is simple and it can be used with any programming language and it has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.

Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Storm integrates with the queueing and database technologies you already use. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed.

More information: http://storm-project.net/

5.4 Apache Kafka

Apache Kafka is a distributed publish-subscribe messaging system that offers a solution capable of handling all data flow activity and processing these data on a consumer website. This type of data (page views, searches, and other user actions) are a key ingredient in the current social web.

Kafka was developed at LinkedIn to serve as the foundation for LinkedIn's activity stream and operational data processing pipeline.

Kafka unifies analysis processing offline and online, providing a mechanism for parallel loading on Hadoop, and provides the ability to partition real-time consumption over a cluster of machines.

More information: http://kafka.apache.org/

5.5 Hbase

Hbase is a non-relational database for Hadoop. HBase is an open source, distributed and scalable Big Data store. It is written in Java and implements the concept of Big Table16 developed by Google. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Apache HBase is the database to choose when we need to have random access read / write real-time, to a large data set.The objective of Apache HBase is the hosting of very large tables (billions of rows X millions of columns) atop clusters of commodity hardware.

More information: http://hbase.apache.org/

5.6 Cassandra

Apache Cassandra17 is a non-relational distributed database-management system based on "key-value" model, written in Java, and initially developed by Facebook. Cassandra allows to handle very large amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure. Cassandra offers high availability, linear scalability and high performance while relaxing some consistency guarantees. Cassandra distributed architecture is based on a series of peer nodes that communicate with a P2P protocol with which the redundancy is maximized.
Cassandra was developed by Facebook to power its Inbox Search functionality, but in 2010 Facebook abandoned the project in favor of Hbase18.Currently Cassandra is developed by the Apache Software Foundation and is available under an Apache 2.0 license.

In early versions Cassandra used its own API to access the database.Currently, Cassandra uses Cassandra Query Language (CQL) that has a similar syntax to SQL but with far fewer features.This makes easier starting using Cassandrar.Allows access on Java from JDBC.

Among its many users19 we found Twitter, using it for their platform20. Adobe, who use it in their product Adobe ® AudienceManager, a Digital Marketing Suite (DMS) that consolidates, activates and optimizes data from various sources digitally addressable21; Ebay has adopted this solutionto support of for multiple applicationswith clusters that span multiple data centers22, or Auth Authorization Service of Ericsson Labs that uses Cassandra as backend database23.

More information: http://cassandra.apache.org/

5.7 Riak

Riak is a NoSQL database Dynamo inspired, open source, distributed and has a commercial version. Database with some key-value metadata without storage schema, data type agnostic, agnostic language that supports through a REST API and PBC24 various types of language (Eralng, Javascript, Java, PHP, Python, Ruby...) masterless because all nodes are equal, scalable, eventually consistent and uses map / reduce and "link"25. Riak is designed to solve a new class of data management problems, specifically those related to the capture, storage and processing of data in distributed IT environments and modern as the cloud.
Riak can be used as a "session store", a file system in the cloud (like Amazon S3), storing a high volume of information and social media (video, audio, etc.) as a caching layer, powering distributed E-commerce solutions, building scalable and dependable mobile applications; in the Cloud Migrating Legacy RDBMS Systems to the Cloud, managing Sensor-based or RFID Network Data and managing User-Data for Online Social and Gaming Network26.
Numerous companies are using Riak27. For example, GitHub, usingthis NoSQL database with WebMachine to develop Github Pages28, a feature that lets you publish web content easily. Inagist29 uses Riak as storage layer.This company, which analyzes in real-time the content of twitter, migrated to Riak30 from Cassandra when it started charging for certain features. In UK, the cloud provider Brightbox uses Riak for various internal projects, including a centralized and searchable record store31.

More information: http://docs.basho.com/

5.8 MongoDB

MongoDB32 is an open source, NoSQL, document-oriented database system. Instead of storing data in tables as is done in a "classical" relational database, MongoDB stores structured data as JSON-like documents with dynamic schemas (MongoDB calls this format BSON), making the integration of data easier and faster in certain types of applications

10gen began Development of MongoDB in October 2007.

Currently MongoDB is a database ready for use and with multiple features. It can be used in different and several fields: files, cloud infrastructure, content management, e-commerce, education.Infrastructure, games, education, metadata storage, real-time statistics, social networks, etc.This database supports the storage of millions of documents is widely used at the enterprise level33.Some of the companies using MongoDB are the MTV Network (using it in its content manager)34, Craigslist35for archiving documents orDisney36 like its gaming platform repository.
At government level we find several experiences, like Gov.UK, which started out using MySQL but moved to MongoDB when they realized how much of their content fitted its document-centric approach37. National Archive in UK, which is consolidating and unifying into one their numerous electronic respositories, uses MongoDB built on a Microsoft software stack38.

More information: http://www.10gen.com/

5.9 Neo4j

Neo4j is an open-source graph database supported by Neo Technology. Neo4j stores data in nodes connected by directed, typed relationships with properties on both, also known as a Property Grap. Neo4j is:

  • intuitive, using a graph model for data representation

  • reliable, with full ACID transactions

  • durable and fast, using a custom disk-based, native storage engine

  • massively scalable, up to several billion nodes/relationships/properties

  • highly-available, when distributed across multiple machines

  • expressive, with a powerful, human readable graph query language

  • fast, with a powerful traversal framework for high-speed graph queries

  • embeddable, with a few small jars

  • simple, accessible by a convenient REST interface or an object-oriented Java API

Neo4j is one of the graph databases leaders in the world. Among its users there are companies like Infojobs Lufthansa, Mozilla, Accenture, Cisco or Adobe etc39.

More information: http://www.neo4j.org/learn/neo4j

5.10 Apache CouchDB

Apache CouchDB is an open sourceand NoSQL database that uses JSON to store data, JavaScript as its query language using MapReduce and HTTP for an API

CouchDB was created in 2005 by Damien Katz as a storage system for a large scale object database. Currently CouchDB is distributed under the Apache License 2.0 and it is used by multiple organizations40 as the BBC that uses CouchDB for its dynamic platform of content, or Credit Suisse's that uses it to store the configuration details of its Python data market framework41.

More information: http://couchdb.apache.org/

5.11 Hypertable

Hypertable is an open source System Manager Database write in C + + and developed by Zvents. It is based on the design of Google's BigTable. Hypertable is a high performance storage system, distributed, scalable, non-relational and does not support transactions . It is ideal for applications that need to manage data that evolving rapidly and it is designed to withstand a high demand for real-time data.

Among its customers include companies such as Ebay, Tiscali or Reddiff.com42

More information: http://www.hypertable.org/ .

5.12 Hive

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

More information: http://hive.apache.org/


5.13 Cascading

Cascading is a Java application framework that enables typical developers to quickly and easily develop rich Data Analytics and Data Management applications that can be deployed and managed across a variety of computing environments. Cascading works seamlessly with Apache Hadoop 1.0 and API compatible distributions.

More information: http://www.cascading.org/about/

5.14 Apache Drill

Apache Drill (Apache Foundation incubating project) is as an open source project reduces the barriers to adopting a new set of Big Data APIs.

Apache Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.

Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called “Dremel: Interactive Analysis of Web-Scale Datasets,” describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel.

Apache Drill represents a huge leap forward for organizations looking to augment their Big Data processing with interactive queries across massive data sets.

Drill Apache represents a major step forward for organizations looking to increase its processing capacity Big Data with interactive queries through massive datasets43.

More information: http://incubator.apache.org/drill/

5.15 Pig / Pig Latin

Pig was initially developed at Yahoo! to allow people using Hadoop to focus more on analyzing large data sets and spend less time having to write map¬per and reducer programs.

Like actual pigs, who eat almost anything, the Pig programming language is designed to handle any kind of data—hence the name.

Pig is made up of two components: the first is the language itself, which is called PigLatin (yes, people naming various Hadoop projects do tend to have a sense of humor associated with their naming conventions), and the second is a runtime environment where PigLatin programs are executed44.

More information: http://pig.apache.org/ and http://www.youtube.com/watch?v=jxt5xpMFczs

5.16 R

R is the leading programming language world for statistical analysis and making graphics45. R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering...) and graphical techniques, and is highly extensible.

R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes an effective data handling and storage facilities, a suite of operators for calculations on arrays, in particular matrices, a large, coherent, integrated collection of intermediate tools for data analysis, graphical facilities for data analysis and display either on-screen or on hardcopy, and a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.

R is more than a set of statistics, is an environment in which statistical techniques are applied. R can easily be extended through several packages.

Many companies and organizations, such as Bing, Facebook, Google, The New York Times or Mozzilla have chosen R to analyze large datasets46.

More information: http://www.r-project.org/

5.17 Redis

Redis is an open-source, BSD licensed, networked, in-memory, key-value data store with optional durability. It is written in ANSI C. The development of Redis is sponsored by Vmware47.

More information: http://redis.io/

5.18 HCatalog

Apache HCatalog is a table and storage management service for data created using Apache Hadoop.

  • This includes:Providing a shared schema and data type mechanism.
  • Providing a table abstraction so that users need not be concerned with where or how their data is stored.
  • Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive.

More information: http://incubator.apache.org/hcatalog/

5.19 Oozie

Oozie is a workflow scheduler system to manage Hadoop jobs. It is a server based Workflow Engine specialized in running workflow jobs with actions that run Hadoop Map/Reduce and Pig jobs. Oozie is a Java Web-Application that runs in a Java servlet-container. Oozie is distributed under Apache License 2.0. Oozie allows users to specify, for example, that a particular query should be initiated only after previous work on certain data has been completed48.

More information: http://oozie.apache.org/

5.20 Talen d Open Studio for Big Data

Talend provides a powerful and versatile open source big data product called Talend Open Studio that makes the job of working with big data technologies easy and helps drive and improve business performance. The product, which was released in 2006, simplifies the development of large volumes of data and facilitates the organization and instrumentation required for these projects.

Talend’s big data solutions provide a full open source solution that connects Apache Hadoop to the rest of enterprise applications, greatly benefiting data scientists in their ability to access and analyze massive amounts of data efficiently and effectively.

Talend Open Studio for Big Data is a core component of the Talend Platform for Big Data, which enables organizations to increase their productivity by deploying big data solutions in hours instead of weeks or months. The Talend Platform for Big Data is compatible with all Apache Hadoop distributions, and is a key integration component of the Hortonworks Data Platform.

Talend’s big data product have more than 450 connectors and besides combines big data components for Hadoop, HBase, Hive, HCatalog, Oozie, Sqoop and Pig into a unified open source environment so you can quickly load, extract, transform and process large and diverse data sets from disparate systems.

The open source version of Talend's Big Data solution has Apache license. In January 2012, the product had recorded 20 million downloads and had about 3,500 customers worldwide.

More information: http://www.talend.com/products/talend-open-studio

5.21 Pentaho Big Data

Pentaho Business Analytics provides deep native support for the leading open source and commercial Hadoop distributions, NoSQL databases and high performance analytic databases.

Pentaho provides a full big data analytics solution that supports the entire big data analytics process from ETL and data integration to real-time analysis and big data visualization.

Pentaho's Big Data story revolves around Pentaho Data Integration AKA Kettle. Kettle is a powerful Extraction, Transformation and Loading (ETL) engine that uses a metadata-driven approach. The kettle engine provides data services for, and is embedded in, most of the applications within the Pentaho BI suite from Spoon, the Kettle designer, to the Pentaho report Designer. Check out About Kettle and Big Data for more details of the Pentaho Big Data Story.

Pentaho Big Data components are open source. In order to play well within the Hadoop open source ecosystem and make Kettle be the best and most pervasive ETL engine in the Big Data space, Pentaho has put all of the Hadoop and NoSQL components into open source starting with the 4.3 release49.
Pentaho provides50:
  • Full continuity from data access to decisions – complete data integration and business analytics platform for any big data store

  • Faster development, faster runtime – visual development and distributed execution

  • Instant and interactive analysis – no coding and no ETL required

Pentaho Business Analytics let us to collect more meaningful patterns that are buried in the large volumes of data, both structured and non-structured.

Pentaho provides the right set of tools to each user, all within a tightly coupled data integration and analytics platform that supports the entire big data lifecycle. For IT and developers, Pentaho provides a complete, visual design environment to simplify and accelerate data preparation and modeling. For business users, Pentaho provides visualization and exploration of data. And for data analysts and scientists, Pentaho provides full data discovery, exploration and predictive analytics.

Beachmint, Mozzilla, ExactTarget, Shareable Ink or TravelTainment, are some of the organizations that are using Pentaho for Big Data51

More information: http://www.pentahobigdata.com/overview

5.22 Jaspersoft para Big Data Open Source

Jaspersoft Business Intelligence for Big Data has a very advanced architecture that it is independent of the data source. Jaspersoft gives us the ability to be immediately compatible with multiple Big Data solutions as Hadoop, MongoDB and NoSQL or analytical databases.
Jaspersoft provides a suite of Business Intelligence with reporting and ad hoc analysis tools, which is intuitive and interactive. From Jaspersoft is possible to connect in real time with native connectors to virtually all the tools of massive databases without any added cost. These connectors can be downloaded for free. Jaspersoft monitors these downloads producing an index called "big data index", an index of Big Data technologies most used monthly that offers a very interesting trends in terms of adoption of Jaspersoft tools.
Jaspersoft has a wide network of technology partners in Big Data, as IBM, 10gen, Cloudera, Basho and others.
More information: http://www.jaspersoft.com/bigdata

5.23 Apache Mahout

Apache Mahout is an Apache project to produce free implementations of distributed or otherwise scalable machine learning algorithms on the Hadoop platform. Mahout is a work in progress; the number of implemented algorithms has grown quickly, but there are still various algorithms missing.

More information: http://mahout.apache.org/

5.24 RapidMiner

RapidMiner, formerly YALE (Yet Another Learning Environment), is an open source software for analysis and data mining. Enables development of data analysis processes by chaining operator through a graphical environment. It is used for research, education, training, rapid prototyping, application development, and industrial applications, and it is distributed under the AGPL open source license.

RapidMiner provides more than 500 operators oriented to data analysis, including those necessary to perform input and output, data preprocessing and visualization. It also allows to use algorithms included in Weka.

Its features are: developed in Java, multiplatform, internal representation of the processes of data analysis in XML files, allowing the development of programs through a scripting language, can be used in several ways: through a GUI, in command line, in batch, from other programs through calls to their libraries; extensible and includes graphics and data visualization tools and has an integration module with R.

More information: http://rapid-i.com/content/view/181/190/

5.25 Hadoop Distributed File System

Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable file system written in Java for the Hadoop framework.

HDFS stores large files (an ideal file size is a multiple of 64 MB), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack.

Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant, because the requirements for a POSIX file system differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX-compliant file system is increased performance for data throughput and support for non-POSIX operations such as Append. HDFS was designed to handle very large files.

More information: http://hadoop.apache.org/

5.26 Gluster FS

GlusterFS is an open source, distributed file system capable of scaling to several petabytes (actually, 72 brontobytes!) and handling thousands of clients.

This allows you to add multiple file servers over Ethernet or Infiniband RDMA interconexines in a large network environment of files in parallel. GLusterFS design is based on the use of the user space and thus does not compromise performance. It can be used in a variety of environments and applications such as cloud computing, biomedical sciences and file storage.

GlusterFS is free software, with some parts licensed under the GNU GPL v3 while others are dual licensed under either GPL v2 or the LGPL v3.

GlusterFS was developed originally by Gluster, Inc., then by Red Hat, Inc., after their purchase of Gluster in 2011.

More information: http://www.gluster.org/about/

5.27 Lucene

Apache Lucene52 is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene has been ported to other programming languages including Delphi, Perl, C#, C++, Python, Ruby, and PHP. While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized[3][4] for its utility in the implementation of Internet search engines and local, single-site searching.

More information: http://lucene.apache.org/core/

5.28 Solr

Solr is an open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable. Solr is the most popular enterprise search engine. Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Apache Tomcat or Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

Apache Lucene and Apache Solr are both produced by the same Apache Software Foundation development team since the two projects were merged in 2010. It is common to refer to the technology or products as Lucene/Solr or Solr/Lucene.

More information: http://lucene.apache.org/solr/

5.29 ElasticSearch

Based on Apache Lucene, ElasticSearch is an open source and distributed search server, based on REST. It is a scalable solution that supports real-time and multi-entity searchs without special configuration. It has been adopted by several companies, including Mozilla, and StumbleUpon. ElasticSearch is available under the Apache 2.0 license.

More information: http://www.elasticsearch.org/

5.30 Sqoop

Sqoop is a Command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables in Hive or Hbase. Exports can be used to put data from Hadoop into a relational database. Sqoop became a top-level Apache project in March 2012. Microsoft uses a Sqoop-based connector to help transfer data from Microsoft SQL Server databases to Hadoop.Couchbase, Inc. also provides a Couchbase Server-Hadoop connector by means of Sqoop53.

More information: http://sqoop.apache.org/


 Source ONSFA. Date: 26/03/2013

Author: Ana Trejo Pulido

1Source: http://www.future-internet.eu/home/future-internet-assembly/aalborg-may-2012/31-smart-cities-and-big-data.html

2Source: http://www.gartner.com/id=2100215

3Source: http://observatorio.cenatic.es/index.php?option=com_content&view=article&id=804:open-smart-cities-ii-open-big-data-&catid=94:tecnologia&Itemid=137#ftn3

4Source: : http://www.opengardensblog.futuretext.com/wp-content/uploads/2012/08/Big-Data-for-Smart-cities-How-do-we-go-from-Open-Data-to-Big-Data-for-Smart-cities.pd

5Will data warehousing survive the advent of big data?” Disponible en: http://strata.oreilly.com/2011/01/data-warehouse-big-data.html

6More information in: http://www.od4d.org/2012/10/08/datos-abiertos-y-datos-linkados/

7Source: http://www.future-internet.eu/fileadmin/documents/aalborg_documents/Report_session_3.1.pdf

8Source: http://blogs.gartner.com/andrea_dimaio/2010/02/19/why-do-governments-separate-open-data-and-social-media-strategies/

9Source: Big Data Drives Rapid Changes in Infrastructure and $232 Billion in IT Spending Through 2016. Available in: http://www.gartner.com/DisplayDocument?ref=clientFriendlyUrl&id=2195915

10More information: http://mashable.com/2012/01/13/career-of-the-future-data-scientist-infographic/

11Available in en: http://wikibon.org/blog/big-data-statistics/

12Source: http://www.dataprix.com/empresa/prensa/mercado-big-data-empieza-despegar-espana

13Web site: http://nutch.apache.org/

14More information: Más información en: http://es.wikipedia.org/wiki/MapReduce

15Source: Fuente: http://es.wikipedia.org/wiki/Hadoop and http://www.ics.uci.edu/~lopes/teaching/inf141W13/slides/Hadoop-AWS.pdf

16More information about Big Table in: http://research.google.com/archive/bigtable.html

17Source: http://es.wikipedia.org/wiki/Apache_Cassandra

18Source: http://www.networkworld.com/slideshow/51090/#slide7

19Source: http://www.datastax.com/cassandrausers#all

20Source: http://engineering.twitter.com/2010/07/cassandra-at-twitter-today.html

21Source: http://www.youtube.com/watch?v=tVbSeNkm8QM&feature=youtu.be

22Source: http://www.datastax.com/wp-content/uploads/2012/08/C2012-BuyItNow-JayPatel.pdf

23Source: http://www.datastax.com/cassandrausers#all

24Protocol Buffers Client (PBC): http://docs.basho.com/riak/1.0.0/references/apis/protocol-buffers/

25Source: http://readwrite.com/2011/02/09/how-3-companies-are-using-nosq

26Source: http://basho.com/technology/why-use-riak/

27Source: http://basho.com/company/production-users/ y http://readwrite.com/2011/02/09/how-3-companies-are-using-nosq

28Source: https://speakerdeck.com/jnewland/github-pages-on-riak-and-webmachine

29Source: http://blog.inagist.com/riak-at-inagistcom

30Source: http://johnleach.co.uk/words/1063/riak-syslog

31Source: http://johnleach.co.uk/words/1063/riak-syslog

32Source: http://es.wikipedia.org/wiki/MongoDB

33Sources: http://www.10gen.com/customers y http://www.mongodb.org/display/DOCS/Production+Deployments

34Source: http://www.10gen.com/customers/mtv-networks

35Source: http://www.10gen.com/presentations/mongodb-craigslist-one-year-later

36Source: http://www.10gen.com/customers/disney

37Source: http://digital.cabinetoffice.gov.uk/colophon-beta/

38Source: http://www.10gen.com/presentations/mongouk-2011/from-sql-server-to-mongodb

39Source: http://www.neotechnology.com/customers/

40More information: http://wiki.apache.org/couchdb/CouchDB_in_the_wild and http://www.couchbase.com/library?type=Case+Studies

41Source: http://www.networkworld.com/slideshow/51090/#slide9

42Source: http://hypertable.com/customers/

43Source: http://www.mapr.com/support/community-resources/drill

44Source: http://www-01.ibm.com/software/data/infosphere/hadoop/pig/

45Source: http://cloudcomputing.sys-con.com/node/2325498

46Source: http://www.revolutionanalytics.com/what-is-open-source-r/companies-using-r.php

47Source: http://es.wikipedia.org/wiki/Redis

48Source: http://oozie.apache.org/

49Source: http://wiki.pentaho.com/display/BAD/Pentaho+Big+Data+Community+Home

50Source: http://www.pentaho.com/big-data/

51Source: http://www.pentaho.com/customers/success-stories/

52Source: http://en.wikipedia.org/wiki/Lucene

53Source: http://en.wikipedia.org/wiki/Sqoop