Monday, June 6, 2011

Consume before you Implement

Don't write any code before the code that consumes it has been written!

Let's assume that we have a project to import a flat file with a list of customers and addresses into a database table.  A "traditional" approach to solving this simple task might involve the following 3 steps:


  1. Think through the problem - design the solution.
  2. Write code
  3. Test the code

Seems simple enough - and on a project of this size, might seem like a good idea to an experienced developer.  In step 1 they can plan out exactly how they are going to solve this problem.  Are they going to use ADO.NET, or are they going to use Linq-To-SQL, or maybe Data Tables & Data Adapters?  Are they going to try to find a 3rd party tool to read and parse the flat file, or are they going to do it themselves?

Once they've answered these high level questions they might get to work - write the code.  Once written, press Run! and see if everything works as they expect.  In my experience, this approach (while appearing logical on the surface) is flawed in a variety of ways.  9 out of 10 times, the code won't work as expected, and what will follow during Step 3 (which was originally conceived of as a "testing" phase) is a process of bug fixing, redesigning, re-implementing, change, reworking, etc of the original "design".  If the developer originally thought that the whole project would take 5 hours, they are usually only starting to "test" the project by the 5 hour mark, and often end up spending another 50-200% of their time debugging the code written in Step 2.

Now - the title of this post is "Don't write any code before the code that consumes it has been written".  Let me explain A) What I mean by this and B) how it changes the approach described above.

Instead of the 3 steps originally considered above, I would propose an iterative approach involving 4 steps, repeated again and again until the project is completed.  When this approach is used (in my experience), the following benefits are realized:

  • 5 hour projects are actually completed in 5 (or fewer hours).
  • Bugs are often avoided altogether, and when they are encountered, are easier to fix/resolve/understand.
  • Estimates of how much time remains on a bug are more accurate.
  • Estimates of how long additional features will take to implement are more accurate.
  • The final product is better designed, more stable and easier to understand.
Let me explain:
First of all, rather than the original steps listed above, I would start with the following plan:
  1. Think through the problem (as described above) and come up with an overall design.
  2. Before writing any code, break the problem down into a series of high level steps.
  3. Think through each step and decide how each step will be completed, and how that implementation relates to the overall design.  This process is recursive back to step 1.  What this means is that each major step is broken down into a series of high level steps.  Each step is then broken down into a series of smaller steps.  Etc.
  4. Write the code for the high level design (with the simplest possible implementation of each major step)
  5. Test the high level design.
  6. Rinse and Repeat steps 1 through 6 for each of the high level steps.  Then, repeat for each of their sub-steps, etc, until the entire project has been implemented, down to the lowest bit/byte.
Let's look at how this would work in the example above:
Before writing the first line of code, the following "plan" would be created (and possibly documented):

High Level Design:
  1. Read the source file
  2. Open a connection to the database
  3. For each line of the file, insert a record into the database.
  4. Close the file & DB connection.
Now - with this high level design (nicely organized and encapsulated) - let's think through each step:
The questions asked/answered in the "traditional" approach are now each relevant to the individual steps.  For step 1, we need to decide are we going to use a 3rd party CSV parser, a Stream Reader, a simple File.ReadAllLines() command, or are we going to write our own CSV parser.  In Step 2 we can decide, are we going to use ADO.NET, Linq-To-SQL, DataTables, etc.  Steps 3 and 4 will be influenced by the decisions made in Steps 1 and 2.

Having thought through this overall design, let's start with our iterative implementation for this project (which is the KEY to this whole thing).  Let's say that we're planning to write our own CSV parser, and use ADO.NET for the database work.  This is the code I would write to start with (some pseudo code here):

public class SampleImporterProject {
  public SampleImporterProject(String fileName, String dbConnectionString) {
    this.FileName = fileName;
    this.DBConnectionString = dbConnectionString;
  } 
  
  // Properties
  private String FileName { get; set; }
  private String DBConnectionString { get; set; }

  // Import Method
  public Import() {
    // Open the source & destination data sources
    String[] lines = File.ReadAllLines(this.FileName);
    SqlConnection conn = new SqlConnection(this.DBConnectionString);

    // Remove the existing customers
    SqlCommand cmd = conn.GetCommand("DELETE FROM Customers");
    cmd.ExecuteNonQuery();

    // Re-create the customers
    cmd = conn.GetCommand("INSERT INTO Customers (Name) VALUES (@Name)");
    conn.Open();
    foreach (String line in lines) {
      cmd.Parameters["@Name"].Value = "new customer";
      cmd.ExecuteNonQuery();
    }
    conn.Close();
  }
}

At this point, this class would be plugged into the project.  The "expected" behavior at this point would be that it should be able to open the source CSV file, open the destination DB, empty the destination, and create one new customer with the name "new customer" for line in the CSV file.  The things that this class will not do comprise a substantially longer list:
  • Any sort of error handling
  • Actually move any "real" data from source to destination
  • Handle a database that Foreign key constraints on the customers (All customers are deleted to begin with).
  • Handle field names in the CSV file.
  • Handle processing of actual CSV fields
  • Handle different file types (CSV, TSV, XLS, Etc).
  • Etc.
The list goes on and on.  This is by design.  The process I'm describing is specifically designed to handle this.

What we do get at this point is that we should be able to run this project, and we should be able to validate that the application is going to work like we expect.  This testing process is critical, because a number of possible bugs are eliminated at this point.  Before we have hundreds or even thousands of lines of code that could be effected, we validate the following elements of our design:
  • We have permission to read the file.
  • By stepping through our code, we can get a "first" look at the actual "lines" of the CSV file.
  • We can see and connect to the database.  All connection data is available, and will exist outside of our classs.
Let's assume that in running our first tests, we find (or validate) that we can't simply delete all of the customers, but rather, have to find existing customers (By ID) and only insert new ones if the customer doesn't already exist.  The import method could be updated as follows: