Scenario: 
An ESB hooks up to an application via a file system - mounted or local. It polls the file generated by the system. Once available, it reads it and processes it in order to send it to final destination. 
Challenge: 
1. A partially generated file may get picked up by the ESB. 
2. Non-transactional behavior. 
Options:
Time based polling: This approach is for ESB to wait for a few minutes before polling the file system. 
Assumption: We know the maximum size of the file that the application will generate and we are 100% sure that the file is completely generated after x number of minutes.
Risk: Suppose we polled for a file at a time t. So, the next polling starts at a time t + x assuming that the next file will be completely written to the disk by the time. However, there was a failure in the application generating the file and by the time it started generating the file again, it is already too late for it to finish writing the file completely by t+x. Thus, ESB gets a half-cooked file.
Conclusion: I think this approach is very risky and likely to generate data inconsistency very often in productive environment.
Size based polling: This approach is for ESB to poll several times before it "intelligently" concludes that the file has been written completely by looking at the file size. 
Assumption: After n number of polling for the same file, if the file size does not grow, we are 100% sure that the file is completely written by the application. 
Risk: There are chances when while writing the file, the application may fail. If the Exception/Compensation is not properly handled by the application, the partially generated file will not be deleted by the application and will stay there. The ESB after polling n number of times, will assume that the file is completely written because its size remains the same. However, actually it is not complete and might be missing its trailer or header.
Conclusion: This is a much better option that the first. However, even in this case there are some problems as mentioned above. So, unless there is a mechanism where the application gets notified by ESB that it has processed x number of records or there was a problem in this file (Which is difficult in file based Fire-And-Forget kind of asynchronous approach), or a proper exception handling in place within the application, there is always a great risk of losing the data - esp if the file structure and parsing is important from ESB point of view.
Polling for an "OK" file: This approach is for ESB to wait for a 0 KB file with same name as data file's name but appended with a an extension such as ".OK" or ".complete".
Assumption: The application and ONLY application knows when it has FINISHED writing a file to the disk.
Risk: If not already implemented in the application, the additional functionality of generating the OK file needs to be implemented. The ESB also in this case will only deal with the archival of the OK file. Hence, it is possible for confusions to arise when there are lots of data file present in the ESB "Inbound" folder. Remember, in our previous options we used to archive the data files in order for ESB to not pick the files in duplicate.
Conclusion: Despite minimum risks involved in this option, in my opinion this should along with Option 2, might be a preferred implementation. Due to non-transactional behavior of the file system, the basic fact is that only an application knows when it finishes writing the file. Hence it is important that it provides another app (ESB) a signal when it finishes writing the file and option 3 is completely based on this theory. 
These options also assume that the application does not have any other means to communicate with ESB apart from a File System. 
 
 
No comments:
Post a Comment