A tale from a customer reaching (and exceeding) the 64 gb limit

As I’ve tweeted I have spent the last couple of days (and the weekend) helping out a customer that exceeded the hard 64 gb database size limit in Lotus Domino. Before discussing how we solved the problem and got the customer back in business I would like you to think about how situations like this could be avoided. And avoiding it is key as once you exceed the size you’re doomed.

First — how and why database platform would EVER allow a database to cross a file size that makes it break. Why doesn’t Domino start to complain at 50gb and make the warnings progressively harder to ignore as the database gets closer to 64gb. Why doesn’t it refuse data once it reaches 60gb? I find it totally unacceptable that a software product allows a database to exceed a size it knows it cannot handle.

Now I know that there are considerations for such a warning and that it could be done in application code (e.g. database script, QueryOpen event) but it really isn’t something an application developer should think about. Also it should be applied to backend logic as well and really doesn’t lend itself to a UI computation. I also know that DDM or similar could warn about it but it still doesn’t change my stance. The 64gb limit is a hard limit and reaching, and exceeding it, shouldn’t depend on me configuring a specific piece of functionality.

Second — having the option of keeping the view index in another location/file than the database would have helped. This has been brought up a number of times including at Lotusphere Ask-The-Developers sessions. One could argue that externalizing the view index from the database would just have postponed the problem but the view index takes up a substantial amount of disk for databases of this size.

Now on to how we saved the data.

The bottom line in this is that the customer was lucky. VERY lucky. The customer uses Cisco IP telephones and keeps a replica of the database in question on a secondary server for phone number lookup using a Java servlet. Due to the way the way the servlet is written only as single, very small, view was built on the secondary server. This is turn meant that the database that had exceeded 64 gb on the primary server was “only” 55 gb on the secondary server. The database on the primary server was toast and gave out very interesting messages if attempting the access or fixup the database:

**** DbMarkCorruptAgain(Both SB copies are corrupt)

So thank God they had the secondary server otherwise the outcome of the story would have been less pleasant because using the secondary server we were able to:

  1. Take the database offline (restrict access using ACL)
  2. Purge all view indexes (using Ytria ViewEZ)
  3. Create a database design only copy to hold archived documents
  4. Delete all views to avoid them accidentally being built
  5. Build a very simple view to prepare for data archiving
  6. Write a LotusScript to archive documents (copy then delete) from the database
  7. Use Ytria ScanEZ to delete deletion stubs from the database (this works for them because the database isn’t replicated to user workstations or laptops)
  8. Do a compact to reclaim unused space
  9. Make the database available on the primary server

Whew! They are now back in business after building views in the database. They were lucky – VERY lucky. If they hadn’t had that secondary replica the data would probably have been lost to much distress. To them and me.

So what are the main take aways from this?

  1. UI check — in the future all databases that I develop will have a database script check on the database size to try and prevent situations like this
  2. DAOS — enable DAOS for databases to keep attachments out of the database and keep the size down
  3. Monitoring — monitor databases either using DDM or other tools to try and prevent sitations like this

And so concludes a story from the field. 4 days later where my hair have turned gray from watching copy/fixup/compact progress indicators the customer is back in and happy once again. Whew!!

11 thoughts on “A tale from a customer reaching (and exceeding) the 64 gb limit”

  1. You are really too much forgiving (imho). 64GB for a document oriented datastore is little. Instead of improving the warnings such limitations should not be in the product anymore.

    If I can store 500GB+ in CouchDb, MongoDb(64Bit) or Riak then nsf should handle that too.

    Of course an interesting story with a good end. Thank you for sharing.

    Like

  2. Regularly have people inflating archive databases in unhealthy regions. DDM warns above 50 Gb and we try to persuade users to delete date, or provide a 2nd archive db.

    Like

  3. The limitation of 64 GB is due to OS level file size restrictions.

    Using Windows servers and NTFS as a file system, the max size of a single file is 64 GB. Using iSeries or a Linux distro, the limit is higher.

    As mentionted, DAOS and DDM monitoring is important when dealing with large Notes dbs.

    Like

  4. I asked Ed Brill in a conference about taking view indexes out of the NSF file and he said that they are not planning to do something about it. I think we have to live with it for another decade or so.

    Like

  5. @Erik – The obvious answer is to separate NSFs into multiple files. No DB that holds > 64GB of data does it with one large file.

    Like

  6. I think this anecdote says less about how Domino should behave with very large files, and more about how it should behave in deployment contexts where administrative attention is very low.

    “I also know that DDM or similar could warn about it but it still doesn’t change my stance.”

    I was going to reply here that given that DDM would have warned them, but they didn’t have DDM set up because it’s a pain for low admin profile shops, the root failure was that DDM has no automated default set up.

    But then I thought I should fact check myself before saying that, so I checked out my test server environment. It turns out that most of this IS set up by default. DDM exists, and turning on the probes is very easy.

    The problem is that there are no default event handlers. You have to define on a case-by-case basis what you want the events that DDM observes to do. So there’s not even an inherent event to email LocalDomainAdmins on a Failure level event.

    I also wanted to find out what the actual event was for DDM if you exceed 64GB in an NSF. As far as I can tell, this is the event…

    Event Type: Operating System

    Event Subtype: Disk

    Event Severity: Warning (Low) Suppression Time: 0 minutes

    Original Text: Attempt to extend file past supported size for platform.

    So, yeah… attempting to exceed the operating system limit on an NSF generates the most casual warning available, on par with the event generated when a modem-based server-to-server connection is completed. (Call is finished)

    I’d like to say that DDM and DCT are amazing and powerful tools that shouldn’t require so much investment overhead to exploit, but even the most casual browsing of the event list for DDM reveals that the events themselves are totally out of touch with reality, and it would require MASSIVE work to get them up to a coherent level of utility for a customer that doesn’t have full-time local Notes administrators.

    Like

  7. Still doesn’t change the fact, that every database should have some sort of archiving worked out from day one.  Otherwise they just grow until they are slow and worthless.

    Like

  8. @Erik Brooks – I agree that splitting a large db into multiple smaller dbs would be preferable performancewise, but sometimes the customer doesn’t want it. One customer didn’t want split a large db(long story behind this) and DAOS was the “saviour”, after hitting the 64 GB limit real-life, fortunately no corruption occured.

    @Henning – I stand corrected. I was told by OS specialists that the 64 GB limit was due to running Windows server with NTFS. Then it seems that IBM/Lotus has hardwire an “lock”/error message themselves, when the Domino server is running on Windows server OS. (One of the options of “fixing” a 64 GB db was to move it to an iSeries server, which didn’t have the 64 GB limit; so I was told.).

    @Nathan – I’m impressed by your thoroughness when replying, and agree with your assesment.

    Like

Comments are closed.