Fix the Problem XIII – The Problem Toggle

(Author’s note: The “L” key on my computer – old one, new one finally ordered – was not working, and I missed it while posting this.  Apologies for the multiple revisions getting all the Ls into place in the title and link…)

While at Ford Motor Company, our group’s manager found a book that he thought was so good he recommended it to everyone, and even paid for our individual copies.  This book, Manufacturing Solutions for Consistent Quality and Reliability, was excellent.  But the book made one fundamental point about problem solving – whether in performance of a product or a process – that has stuck with me ever since (paraphrased):

You can only claim to have truly solved the problem when you understand it well enough to turn it on and then back off again.

With this in mind I’m going to review two instances from my career where this happened.  (I will revisit this theme in future essays to discuss more case studies.)

Plasma Cutting Torch: Mysterious Leaks in the Field

This particular torch was part of a plasma cutting system; I was the Manufacturing Engineer in charge of the torch area, and working on identifying the root cause of field failures was one of my responsibilities.  This particular torch had a significant rate of return; these returns, responded to by sending a new torch, was a large part of my department’s total warrantee cost.

The issue was that the problem could not be duplicated. We would receive the torch which, according to the customers, would have coolant leaking out the front “business end” of the torch.  No matter what we did, we couldn’t get returned items to do it in our lab.

Lesson One: Unless you can duplicate the failure, in order to experiment with what does and does not trigger it, you are flailing in the dark.

So one day we get a torch back where we’ve been told the coolant is gushing out in a waterfall.  We go to our own test machine, connect everything up, turn the coolant flow on… and nothing (like always).  We exchange parts of the system for new ones.  Nothing.  Go back to the original, returned pieces.  Nothing.  Try as we might with permutations of new parts and returned parts, we cannot duplicate the reported failure.

Let me take a moment to say that I trust intuition.  Something was nagging at the back of my mind.  I couldn’t even quantify it, but something was bothering me.  Just as the technician was about to turn off the coolant flow, I said “Hold it, I want to try something.”  I reached over to the torch, grabbed the retaining cap that threaded on and which held everything inside, and started to turn and loosen it.  I had barely touched it when coolant started to jet out of the opening.

I tightened it.  Nothing.  I started to turn it, and again, coolant flowed copiously.  AHA!  I tightened it to its hard-stop and made a mark.  Then I slowly started to loosen it until, like a floodgate opening, the flow started again.  I marked that too.  It took, maybe, a 0.25” of distance, as measured on the outside diameter, to make this difference.

Armed with this information I went to the prints and calculated that – going from memory – the axial translation of the cap being unscrewed was on the order of 0.020”.  This information was passed to the Design group, which found that the issue was design-related (details omitted for confidentiality reasons).  Designs of a few, key components were tweaked and prototypes made.

The result: The redesigned torch wouldn’t leak despite backing the cap off double what I had done.  After the change, warrantee returns started to drop dramatically as the redesigned torch was propagated into the customer base.

Lesson two: Intuition and hunches are often based on a subconscious stew of disparate facts coming together.  While you shouldn’t just go with them – a systematic approach like an 8D Problem Solving Process is needed – don’t ignore those tickles at the back of your mind.

One other thing to note.  In retrospect these symptoms and the “no problem found” status made sense.  Plasma cutting is often a dirty environment with grit, metal chips, etc., around.  Likely what happened is that people took off the cap while changing the consumables – the “razor blades” – putting it down in such a way that grit got onto the surface that was supposed to be flush with the surface that provided the hard stop when installing the cap.  This created an inadvertent shim that coupled with the design issue to create the leak.  In the process of being shipped to us the grit would fall off, removing that inadvertent shim and resulting in a torch that would function as intended with no problem.

Lesson three: When troubleshooting, think about the environment where the failure is occurring.  Ideally, go and watch.  There’s nothing like seeing the precise situation for generating data, even if that data only goes into the aforementioned subconscious.

O-ring Rollout: Leak Failures in Manual Assembly Area

At the same plant where I first was given this book, one of the products was a carbon canister assembly that fitted into the fuel cell.  Functioning to absorb gasoline vapors coming off the fuel tank for emissions control, some of them had several hoses with male attachments that would be inserted into a female port.  The work time standard was strict and people pushed hard to meet it.  (Note that I have a portfolio page about this problem.)

At issue was the fact that the O-rings forming the seal at these ports would roll out of the groove they were in, creating a leak path.  This assembly defect was internal to the female port and so was not visible.  The first indication there was an O-ring rollout was that the unit would fail the leak test.  The unit would then be methodically disassembled until the rollout was found.  Then it was reassembled after reseating the O-ring, and retested.  As you might imagine, this was quite time-consuming (not to mention not value-added).

Having been asked to look into the problem, one of my first actions was to look at the Design Guidelines.  Since my Master’s Research was in Design for Manufacturing and Assembly, I had – stated immodestly! – a pretty good grasp of the dynamics of how things go together.  One thing I noticed was the design of the lead-in.  Although “perfect” from a molding standpoint, having a radius as a lead-in was not so great from an assembly standpoint.  The reason being is that the O-ring needed to slide along the surface without being “grabbed” by friction.  The governing equation is:

Arctangent(angle) < coefficient of static friction

With a visual:

angle image

Note that in no case will an angle greater than 45 work.

What I found, in a detailed examination, was that it was possible to misalign the male insert sufficiently so that the O-ring would hit on the part of the radius where static friction would dominate.  This would then “grab” the O-ring and, as the insertion progressed, it would roll out of the groove.

In a Design for Assembly analysis I’d written before on my blog I referenced Fitt’s Law.  I applied it in this case and found that it was a difficult task for a person to do reliably – which explained the high rework rate.  If I redesigned the lead-in to be a 30 degree chamfer, as shown in the portfolio page (referenced again for convenience), I essentially made it impossible to NOT get the O-ring on a sliding surface.  (NB: a 15 degree angle is my “perfect” recommendation for this situation.)

Lesson Four: Very often there is a Design issue at the root cause of a production problem.  Not always, of course – but in my experience the probability has been very high that Design is a contributor to the issue.  Note that Design is the foundation: Reliability, Functionality, and Quality all start with Design… one can have production issues even with a good design, but one cannot have good production with a bad design.

Based on my write-up we made an insert for the mold of the female port (fortunately the mold boss forming the core of the port was an insert that could be easily changed!) and tried it.  Leak failures from O-ring rollouts fell to nothing.  But there’s one more lesson… I got a call from the Design group asking me why I was proposing changing the design, including altering the Design Guidelines.  When I asked if he’d seen my write-up, he said yes.  When I asked if there was a problem with it, or with the results showing it worked, he snarled – literally snarled – through the phone: “You’re just a Manufacturing Engineer, what can you know?”.  Needless to say I was tempted to retort, but again returned to the successful results and appealed to his “We’re all one company, right?” spirit.

Lesson Five: People can get very protective of “their turf”; keep that in mind as you propose changes, especially if the changes are in someone else’s department.  (In retrospect I should have involved the Design group from the beginning to have them on the team and involved once I figured out this was a Design issue.  Thus I would have avoided turf battles, toe-stepping, and bruised egos.)

The Problem Toggle

In both instances changes were made that turned the problem off.  In both instances we understood the root cause well enough that if we had gone back to the old design, the problem would have returned… and why it would have returned.  In these two cases there was just one true “root cause”, Design, but in other cases I’ve experienced there were multiple factors that worked together to create the problem.

Only by systematically working through a formal process, often including tools like an Ishikawa Diagram which can be very useful, testing each identified possibility by duplicating the failure conditions to see if the change affected failure rates, can problems be declared solved.

Otherwise, solutions become a variant of “I’ve got everything just right; don’t touch anything!”  And that’s not a way to design and produce in today’s hypercompetitive world.

 

© 2014, David Hunt, PE

2 thoughts on “Fix the Problem XIII – The Problem Toggle

  1. Pretty much the same principles apply to the troubleshooting of software issues too. You can almost never truly fix a software bug until you’re able to replicate at will.

Leave a comment