Home Blog
Follow Me on Pinterest

Before I begin let me start by saying that I have committed all of these errors at some point(s) in time. Sadly, I will probably continue to commit some of them, but it's a work in progress.

Keys to writing bad unit testsAt Attivio we strive to deliver high quality software and part of our strategy involves Junit tests. Currently we have 18,000+ tests that run as a part of our continuous integration. While many tests are straightforward (2+2=4) we do have some very complex tests that test multiple processes, threads, and their interactions.

Over time we've struggled with 'flaky tests' that fail 'sometimes'. These are even worse than the usual 'it works on my machine' type scenarios since it will likely work on most people's machines and even the build machine 99% of the time. Worse still, we often times find that the test itself is the problem and not the underlying code. This blog post is meant to cover some of the issues we've encountered to help us spot bad tests.

Use Thread.sleep()

If you've ever used Thread.sleep(someAmountOfTime) in any unit test you're probably had that test fail at some point for no apparent reason. If you're running your tests in a hosted or virtualized environment, they probably fail more often. The problem with waiting 'long enough' is that any number of events outside your control could cause this sort of testing to fail. For example a garbage collection could be triggered at just the wrong time or another process on the machine could grab all the resources, making your timing-based test useless. In a virtualized environment your own test in another VM on the same physical machine might even be the culprit! For example, we've had code similar to this fail in our continuous build:

long start = System.currentTimeMillis();
Thread.sleep(1000);
long stop = System.currentTimeMillis();
Assert.assertTrue(5000 > stop-start);

While seeing Thread.sleep() in a test doesn't necessarily mean your test will be flaky, it does mean you should take another look at the test itself. If there isn't a while (someConditionNotMet)} block around your {{Thread.sleep() you should be even more cautious.

Use competing threads

Pop quiz: What is the order of the output for this program?

Thread t1 = new Thread() {
  public void run() {
    doSomethingThatTakes30seconds();
    System.out.println("t1 done");
  }
}.start();
Thread t2 = new Thread() {  public void run() {  sleepFor10seconds();  System.out.println("t2 done");  } }.start();
Thread t3 = new Thread() {  public void run() {  // do nothing  System.out.println("t3 done");  } }.start();

If you said the output was:

t3 done
t2 done
t1 done

You'd be right, most of the time. If you said

t1 done
t2 done
t3 done

You would also be right.... but very rarely. Using threads and relying on them to 'start doing stuff' when you call start() is dangerous. Anytime you're using threads, you need to make sure you're using something from java.util.concurrent, synchronization or just not checking any state until you've called join() on all threads, and then make sure the threads don't affect each other. You also need to stop and think about your code and test code carefully as this stuff is almost never as straightforward as it seems. In many of our tests, the logic and data structures used for testing a block of multithreaded code are far more complex than the actual code itself.

Hard coded sockets/file paths

If you've ever created a socket server or read from a local file system outside your source tree, you've probably had tests fail randomly on someone's machine. They failed because there was another process running on that port (maybe another test session running) or because you keep your test data in c:\mydata instead of c:\somedata. Either way, your test is brittle and can be affected by other non-code related issues. At Attivio, we use a system property for a baseport that defaults to a standard value. Anyone running tests locally gets the default port value but the continuous integration builds set the system property to a different value for each plan to avoid conflicts. Further, any files we go after are either in the source tree or pulled from maven by the build system. This also helps us to enforce good coding practices to make sure someone doesn't accidentally hard code a default configuration value or file path into production code.

Pull in a lot more functionality than you need

The easiest way to ensure that your unit tests take a long time to run and are brittle to other up or downstream changes is to pull in other unrelated functionality into your tests and then rely on that logic for your test to succeed.

If you have a component that moves data from Point-A to Point-B, make sure it only tests that. Don't add in Point C that wraps A to B logic and is easy to test, but not a part of the code you are currently working with. Since our system is built on top of an ESB, we often see developers want to put their component into a workflow in a running system and then feed content through it and the rest of the default workflows to make sure it can perform it's job.

The worst case of this is pulling in a larger framework than you need to test your code. For example, you don't need to load a spring context to test your bean. Even if it implements Spring lifecycle methods like InitializingBean, you can still call afterpropertiesSet() yourself and avoid the spring overhead.

Conclusion

I'm sure there are many other bad practices, but the ones I've listed above represent some of our worst offenders. As I said in the intro though, everyone makes these mistakes. The key is being able to track why a test is failing so you can start working on a fix.

One of our colleagues at Attivio has a niece and nephew who are as fluent in Japanese as they are in English. Their mom is Japanese and their dad is American, so they have a completely bilingual household. In one moment they might talk to each other in English, their Mom or Dad might call to them from another room in Japanese, and they will answer in kind, switching between their two languages as easily as switching TV channels.

Unified information access (UIA) technology is a lot like being strongly bilingual, in that UIA also quickly and easily communicates information that spans different worlds — specifically, structured data (databases) and unstructured content (documents/text), whether from internal and external sources.

Just as our colleague's niece and nephew can communicate as easily with anyone when visiting Japan as they can at home, a true UIA platform can also freely communicate with disparate information sources and with other applications; particularly BI tools, self-service dashboards and analytic systems. Doing so requires supporting the widely-used SQL (Structured Query Language) standard, via ODBC/JDBC connectivity.

Background

Much energy and effort has gone into the production of tools and technologies to analyze data. From iPhone and iPad apps to spreadsheets, reporting tools, "self-service" dashboards, various analytic systems, right on up to full-blown ad-hoc drag & drop BI tools, we live in an era where everything is analyzed, and the tools we use for that analysis actually contribute to better decisions. One of the keys to the interoperability of this huge ecosystem is a standards-based approach: the broad use of the Structured Query Language (SQL) is the reason the eco-system exists.

The downside to many of these tools is that they operate only on so-called structured data — until recently, ignoring valuable context contained in unstructured sources. Without an integrated and fully correlated view of the complete picture, organizations will miss out on a much wider world of business insights and understanding; not unlike relying on a really bad language interpreter (poor Bill Murray!):

 

Unified Information Access

Fortunately, Attivio’s Active Intelligence Enging (AIE) gives you “the best of both worlds.” Because AIE supports querying in SQL via ODBC and JDBC, organizations can use it to explore all information regardless of source or format. By deploying AIE as a back-end unified information source, your users can continue to use BI and other tools they are comfortable with — but now with the added ability to access a far more complete business informational picture, for more informed decisions and deeper understanding that is simply not possible working with structured or unstructured information alone.

Attivio-TibcoSpotfire Screenshot

One key to making this happen: Active Intelligence SQL (AI-SQL) - a set of full-text function extensions to SQL. AI-SQL functions make it easy for SQL query authors to incorporate AIE's unique features including operations like:

  • REGEX — find rows based on a pattern in a field.
  • STARTSWITH — find rows in which a given fields starts with a specified string.
  • ENDSWITH — find rows in which a given fields ends with a specified string.
  • NEAR — find rows based on two or more terms being within a specific number of words of each other
  • ONEAR — find rows based on two or more terms being within a specific number of words of each other in the order they are specified to the function.
  • FULLTEXTSEARCH — apply a simple query language query as a filter to a specified field.

The fulltextsearch function enables blended search and analytic user interfaces by enabling applications to plug user search box input into a SQL query to allow the user to interact with data - but in a controlled, simple way.

Some examples of using AI-SQL extensions:

select r_regionkey,r_name from all_tables where r_name = regex('e.*e')
select p_partkey,p_container from all_tables where p_container = startswith('brown')
select p_partkey,p_container from all_tables where p_container = endswith('car')
select p_partkey,p_container from all_tables where p_container = near('brown','car','set(distance=1)')
select p_partkey,p_container from all_tables where p_container = onear('brown','car','set(distance=1)')

select company.company, company.ticker, count(news.newsarticleid)
FROM company INNER JOIN news ON company.ticker = news.ticker
WHERE news.content=fulltextsearch(?UserSearch)
GROUP BY company.company, company.ticker

The Importance of JOIN

It should be noted that it is no easy feat to support a detailed query language like SQL, while also serving as a UIA platform that can ingest and search across countless information sources at massive scale.

Some SQL capabilities, like a single field GROUP BY, are easily accommodated by unstructured search, and are relatively straightforward to handle.  For example, given the following query:

SELECT name,count(*) FROM customers GROUP BY name


This can be easily issued as an AIE facet query, asking for facet values and counts on the name field:

query-request> table:customers
facet-request>
name


However, other SQL features, like JOIN, are among the most difficult to support; but happily, they are handled by AIE’s patented ability to dynamically JOIN data and content without advance data modeling on an unstructured index. These advanced AIE capabilities allow us to execute all of the queries of the TPC-H benchmark, including TPC-H "Query 3" joining three tables and aggregating the results:

SELECT l_orderkey,  
SUM(l_extendedprice*(1-l_discount)) as revenue,
o_orderdate,
o_shippriority
FROM customer,
orders,
lineitem
WHERE c_mktsegment = 'BUILDING'
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < '1995-03-15'
and l_shipdate > '1995-03-15'
GROUP BY l_orderkey,
o_orderdate,
o_shippriority
ORDER BY revenue DESC,
o_orderdate
Partnering for Success

While AIE supports a wide range of SQL, as with any backend system, some SQL operations will not be available. According to a leading analyst, no relational database actually supports the full ANSI-92 standard.

BI vendors recognize this and provide mechanisms to tune the SQL that a tool tries to issue to a specific backend. To help our customers in ensuring success with a given BI tool, Attivio has implemented a BI vendor certification program, which currently includes Tableau, TIBCO Spotfire and QlikView.

This program certifies tools into two levels:
  • Gold — The BI tool is certified to work with AIE using standard SQL
  • Platinum — The BI tool is certified to work with AIE using standard SQL, but also supports the use of AI-SQL extension functions.

Opening up the flexibility of AIE’s universal index to support a powerful industry standard language like SQL via ODBC and JDBC, enabling BI tools and scores of other compatible applications to access and analyze unified information, has proven to be a compelling value proposition for organizations looking for new opportunities to build revenue, cut costs and/or increase competitiveness. Clearly, AIE also speaks the universal language of business success.

Learn more about AIE’s support for SQL via ODBC and JDBC, as well as AIE’s Query-Time JOIN with no data modeling required.

Attivio, like many other companies, uses open source where appropriate. The Java community in particular has great open source technologies powering some of the hottest technology trends today, from big data (Hadoop and MapReduce) to columnar storage (MonetDB) to IDE frameworks (Eclipse).

There‘s also a trove of infrastructure technologies out there for logging, dependency ingestion, character set normalization, etc. Knowing when to use open source and when to write your own code is where the real value comes in as a developer and more importantly, can save you and the company a great deal of time and resources.

All that being said, there are often bugs and functional gaps in open source code and as a developer you need to have a system in place that allows you to handle these issues. For example, we may find a bug in a particular package and submit a patch but since we don't control the release cycle of the open source projects we can't simply wait around for the fix. We are also very picky about less critical run-time issues such as threads being left around after a process or unit test finishes. Many open source projects assume that they are going to be run as a standalone server that terminates with the JVM and these sorts of assumptions can break or disturb our unit test environments.

We've recently switched to the vendor branching methodology described here against our Subversion repository. This allows us to import external projects into our revision control system.

When we encounter an issue in open source code, we:

  • Check in a copy of the source code for the class(es) we plan to fix or enhance. It's important to have this copy so we can easily compare and merge new revisions of the upstream project with our own modifications.
  • Make changes to the class(es) annotating them with //attivio start mod and //attivio end mod to make quick scans for changes that are easier to tease out. We put a reference to the internal developer ticket in the code as well; so later reviewers can back track those changes easily when upgrading upstream code.
  • Compile and build these changes into a JAR file that contains not only the class files, but also the source files in order to comply with many open source licenses. Generally speaking, there is no intellectual property in this code so we don't have to worry about anyone reviewing our changes.
  • Deploy this JAR to our internal Maven repository with an Attivio specific revision. We then update our top level POM (Project Object Model) to reference this Attivio version instead of the public Maven repository version.

Lastly, we strive to create a formal ticket, patch and test for contribution back to the open source community. In all fairness, this is our weakest part of the process, but one we are striving to improve. Many of our changes are small in nature and fix either esoteric edge cases or general code cleanliness like the thread example I mentioned above, but most changes are still useful for the wider community.

Related:

Enterprise Strategy Group Report - Today's Information Access Requirements Outpace Open Source Search Options

Thinking About Replacing Your Search Engine or Search Appliance?

Software at the Speed of Light

I recently attended the YES Boston panel on Big Data & Analytics at the Harvard Innovation Center. Overall it was an excellent discussion. At least one panelist indicated that "It feels like 1995 again", referring to that heady period when the Internet emerged and drove the dot-com era forward. Most of the focus was on the superb opportunities that Big Data creates for entrepreneurs. A few panelists also suggested that Big Data would lead to the death of the traditional relational database and data warehouse.

Earlier in the discussion, one panelist characterized Big Data as having "three V's — Volume, Velocity, Variety" etc. This has become the standard way to segment the various use cases that collectively add up to "Big Data", as well as a number of other often cited characteristics like value and complexity. However, another panelist then said that Big Data was mostly about unstructured information. I have written about how all unstructured data is not the same previously. Most "unstructured" in the context of Big Data is "data" — variable length log files, etc. Not truly unstructured CONTENT, which includes articles, web pages, documents, email, etc. It is important to understand the difference, especially from the entrepreneur's perspective. Much of the "volume" in the Big Data that is unstructured DATA — again, log files, mostly — has very low individual value. It is only when analyzed in volume that it becomes interesting and valuable.

One of the more interesting questions at the end of the panel came from Wikibon's Dave Vellante. He asked, more or less, "Why are big data and the new technologies that are emerging to analyze it going to be disruptive to the enterprise data warehouse?"

Here, the panelists' answers seemed uncertain. Several spoke about the challenge of getting centralized IT to produce new information from legacy BI tools. While this is probably impossible to argue with, at least one panelist went too far, saying that data will just be kept in new systems and BI tools will work directly against them. I didn't buy this angle, and in a follow-up conversation with Dave, he agreed with me that the panel mostly missed the mark.

I would have answered Dave's question like this: the key with Big Data is to take the volume of low value items and turn it into high-value analysis. That analysis then needs to be co-mingled with other information that has high item value. This includes email, documents, text in applications, rows in databases, ERP, CRM etc. That isn't disruptive to the eDW in and of itself. Most big data — like behavioral information — won't be interesting to typical corporate decision makers. A few data scientists etc. will analyze click streams and use it to optimize the end user experience. The transactions (hopefully sales) that result from that will go into the eDW. The yield improvements will also be analyzed and tracked over time — again, probably by traditional BI tools.

The best example I can give of a real world case is from Attivio's IT Knowledge Expert solution. ITKE analyzes log files from operating systems and applications to identify events that are interesting and/or problematic. For example, we may drop or summarize informative messages, keep warnings, and correlate errors. This is great because it helps system administrators quickly discover the symptoms of an issue. However, it is the other data — the high value articles, knowledge bases, SharePoint articles created by previous admins, etc., in which the solution to the problem is found. This is why I refer to content as high-value. It explains WHY things happen.

Related:

Missing Some Key Points with Big Data

Extreme Information: Completing the Big Data Picture

Attivio IT Knowledge Expert

Introduction

Modularity is a hallmark of good application development practice. When attempting to rapidly implement a web-based application, basing the front end on one or more REST APIs is a common choice. A REST API is simple to use and provides for easy client-side bookmarking. Since most of our customers develop some type of custom web UI for AIE-based applications, we've made it simple to build robust, testable REST APIs.

Building a REST API in AIE requires extending PlatformComponent and overriding two methods. Testing the API involves a few lines of code and use of shipped test framework classes (the same we use to test AIE). In the example below, I create a REST API that provides the current total memory and memory in use by the hosting JVM.

A simple REST API

Creating an AIE service that provides a REST API is as simple as extending the PlatformComponent class and overriding a couple of methods.

Override convertCGIRequest

Whenever a service that exposes an HTTP endpoint is accessed with a GET request, the request is converted by AIE into a CgiRequest message. CgiRequest messages contain all of the parameters and headers of the request. The convertCgiRequest method is a hook for the developer to translate the request into a message that the REST service can handle. This can be as simple as translating the request into a StringMessage with the relevant data, but I prefer to use simple inner classes to make requests concrete instead of opaque (this also makes testing easier). Returning null from this method indicates the CGI request is not recognized and the service should return nothing. Alternatively, you can throw an exception that will result in an error getting returned to the caller.

Sample GET request

http://localhost:17001/memtracker?mem=true

Overridden class

/** {@inheritDoc} */
  @Override
  protected PlatformMessage convertCgiRequest(CgiRequest cgi) throws AttivioException {
    String mem = cgi.getCgiParameter("mem");
    if (mem != null) {
return new MemRequestMessage(); // concrete private message class
} else {
return null; // no valid request found... alternatively could throw an exception
}
}

The inner message class

public static class MemRequestMessage extends AbstractPlatformMessage {}

Override handleMessage

handleMessage()

/** {@inheritDoc} */
@Override
protected PlatformMessage handleMessage(MessageContext context, PlatformMessage msg) throws AttivioException {
if (msg instanceof MemRequestMessage) {
return new CgiResponse() {
@Override
public void writeResponse(HttpServletResponse resp) throws IOException {
resp.getWriter().format("%d,%d", Runtime.getRuntime().freeMemory(), Runtime.getRuntime().maxMemory());
}
};
}
}

Testing

As has been pointed out here, here and here, at Attivio we really believe in testing. We support that philosophy by making it as easy (fast to write/fast to run) to test functionality as possible. To that end, we have developed supporting classes for testing PlatformComponents (including services) without having to start a full AIE system. With the AIE SDK, we ship those supporting classes so that our customers can test easily as well.

Testing the Service

@Test
  public void cgiTest() throws AttivioException, IOException {
    MemoryApiService srv = new MemoryApiService();
    TransformerTestUtils.startTransformer(srv); // mock-up system stuff and start the service
    
CgiRequest req = new CgiRequest(null);
req.setCgiParameters("mem=", IOUtils.DEFAULT_ENCODING); // provide the full cgi query string we are testing, use UTF-8 encoding
CgiResponse respMessage = (CgiResponse) srv.onCall(null, req); // execute the request
MockHttpServletResponse mockServletResponse = new MockHttpServletResponse();
respMessage.writeResponse(mockServletResponse);
String[] results = mockServletResponse.getWrittenResponse().split(",");
Assert.assertTrue(Long.parseLong(results[0]) < Long.parseLong(results[1]));
Assert.assertTrue(Long.parseLong(results[0]) > 0);
}

The MockHttpServletResponse that is available in AIE 3.1 is a very simple class. Below is a partial implementation for your use if needed.

/**
   * A mocked version of a HttpServletResponse
   */
  public class MockHttpServletResponse implements HttpServletResponse {
    private final ByteArrayOutputStream bos = new ByteArrayOutputStream();
private final PrintWriter writer = new PrintWriter(bos, true); /**
* @return the response written so far
*/
public String getWrittenResponse() {
return bos.toString();
} /** {@inheritDoc} */
@Override
public PrintWriter getWriter() throws IOException {
return writer;
} //// rest of default (unchanged from Eclipse auto-generation) implementation omitted for brevity.

More Articles...

Page 6 of 28

<< Start < Prev 1 2 3 4 5 6 7 8 9 10 Next > End >>