Apache Mahout is a library of code you might want to use in an analysis or search application. The complexity is enough to make you wait for a finished application using the library instead of writing your own. Unfortunately finished applications are rarely finished when first released and are oversold on capability. You need to understand the benefit and the limitations of the underlying technology before committing to use the results from any application.
Apache Mahout is so early in the software development cycle that it is barely beta software and you use it at your own risk. There are some functions with significant testing and some that are brand new. Every release will make some functions mature while others will be fresh out of the oven. In fact some will be fresh out of the mixing bowel before going in the oven. You have to look at the testing of every function before use.
If you are working on a Mahout based project, you will know the history of each item in the Mahout library. I tried to work out the status and testing of some individual functions and it was all too hard. The best I could find was a division between
I started looking at Mahout for use in a CMS, Content Management System, because the CMS has an interface module. The interface module does not document what Mahout does. Mahout provides documentation then warns that some functions are new with little testing so I tried to find out more. I did not find anything obvious and, for my purpose, would have to treat everything as untested.
Fuzzy logic was a fashionable term a few years ago and now people hate fuzzy logic because it did not do what software salespeople promised, sometimes it produced errors. Apache Mahout offers a range of functions including some that are best described as fuzzy logic. Fuzzy logic means
we will guess the result if there is nothing obvious.
The ideal choice of search would find exactly what you want and, if there is not an exact match, give you the closest result. Fuzzy logic chooses any result that is a quick fit. Fuzzy logic may ignore an exact match because it does not enforce the type of rules required to find an exact match. Fuzzy logic may return a guess without telling you it is only a guess. You cannot trust the results of a fuzzy logic search or classification system. You have to have an additional validation or feedback process to check the results.
Apache Mahout library functions have to be carefully analysed to understand the accuracy and validity of their results when applied to your data.
Apache Mahout is written in Java because Apache Mahout is designed to work with other software written in Java. When you have enough experience of using Java, you realise you have to extensively test Java based applications with different data and in different environments to make sure it will not crash. The cost of supporting Java is 50% or more higher than the alternatives. Clearly you would have to get a lot of benefit from the Apache Mahout library functions to cover the support costs.
Mixing Java Web software with other Web software creates additional installation and maintenance problems. Some of the interface modules for Java choose to leave Java in a separate server and communicate using Web services. If you use PHP, for example, Mahout can be accessed direct because PHP can call Java direct. PHP is equally good at calling Web services. You can put Mahout on a separate Web server focused on Java and maintained by a Java expert.
You can get packages that wrap the Apache Mahout library in a framework or Web service for use from applications written in mainstream programming languages. If you are not a Java programmer, separating the Java code from your Web site makes more sense than trying to maintain a mixture of code in two languages.
An interface framework gives you another advantage. The data you use in your code can stay in the same format all the way through your code. The framework can perform data conversions between your code and Mahout or Web services.
Drupal is the worlds's most popular content management system for new Web sites bigger than a blog. Drupal is an example of an existing application connecting to Apache Mahout. First, Drupal will not depend on Apache Mahout, Apache Mahout will only be added as part of an optional add-on module and you are free to choose alternatives. Shopping carts and similar applications will have the chance to recommend exact matches, based on an understanding of the product range, before resorting to Apache Mahout.
As an example, a Web site selling disk drives knows that rotation speed is important when selecting a disk drive and disks have a small number of rotation speeds with 7200 being the most common fast disk. A disk drive shop can offer exact selection of rotation speed before resorting to a recommendation that may be imprecise.
Second, Drupal will use Apache Mahout through the Recommender module which is clearly designed to recommend a close match, not an exact match. A recommendation might answer the question
fastest cheapest disk drive by listing cheap disk drives ordered from fastest to slowest. You can choose the recommendation or continue browsing. The important thing is the bit stating what the recommendation is based on.
Apache Mahout is used for data mining. A lot of functions reduce data for classification then analysis and reporting. A social Web site might classify people by country and gender to feed data into a marketing campaign and to decide what content should be used. What they might not realise is the number of families that share one login or the number of people from the middle east who login using the husband's id because there is a local belief that women should not communicate freely with the open world or the number of men in the western world who do not use their id when logging in because they work in the military and would be fired for expressing an independent opinion.
Most data is categorised then presented as if the classification and categorisation is accurate. When results are presented that are obviously wrong, the error is a deliberate attempt to make the results fit a marketing campaign instead of the other way round. Adobe want to sell products to create Flash files. Adobe tell you that 99% of Internet users use Flash. The Adobe figures are not based on Internet users. Instead the Adobe figures are based on a survey of Flash users. Apache Mahout results are only as good as the data plus the user's understanding of the data.
There are lots of things you can do with Mahout. I mentioned a sales example. The most common use of recommendation software is to direct a customer to a product. A customer logs into your shop and browsers notebook cases. Your software can find a previous sale of a notebook to the same customer and feature notebooks designed for the size of notebook your customer previously purchased.
You do not need Mahout for something this simple. A typical sales system looks through the notebook brands to find the one with the biggest profit margin then looks through the range from that brand for the ones recommended by the manufacturer to work with that model notebook.
When would you use Mahout or an equivalent? When the decision becomes complicated. Your customer might have purchased several different notebook computers or none. You have to use other selection criteria. you start using age, gender, country, city, previous purchases by brand, price ranges, anything that might influence their decision.
Now make the analysis more complicated. You want to advertise a special offer. What product would best benefit your sales from a price reduction? Now you have to find products that can sell in volume to your existing customers and are held back only because of a price slightly higher than they will pay.
You might analyse every sales record from day one when you set up your business. You might analyse from the day you expanded to multiple brands. Adding a large analysis library to analyse your data gives you more choices and makes some things easier.
There is still the problem of deciding which tool you will use. You can insert a screw using a hammer but there might be a better tool in the toolbox.
A big problem with large volumes of data is finding the right data. Google is one of the best engineered search engines but it often produces results that are close to useless. You see all sorts of problems. Google will put out of date information ahead of current information because the old Web pages have accumulated more links. Google puts quotes in blogs ahead of the original source because blogs have better keyword density.
There are some really easy ways to fix Google for many common types of search. Google does not offer the option to input critical factors. If they used Mahout, Mahout would let them put the critical factors in but Mahout will not tell them what the critical factors are. Neither does the documentation at the Mahout Web site.
You have to go back to your data and understand what the data means. You can then propose tests to prove the meaning. Mahout might provide the best code for the analysis in your test.
Forget Apache Mahout by itself. Look instead for pre-built connections into your applications that use Apache Mahout, they will give you the best benefit in the shortest time. When you do find a quick way into Apache Mahout, trace backwards to find the exact functions used and the exact reduction performed on your data to find if it is an exact result or just a guess.