2005. szeptember 22., csütörtök

Getting the number of records from a fixed-length ASCII file


Problem/Question/Abstract:

I work a lot with fixed-length ASCII files, and I need to know how many total lines there are in a file. Sure, I can open up the file in a text editor, but really large files take forever to load. Is there a better way?

Answer:

As Mr. Miyagi said to Daniel-san in Karate Kid, "Funny you should ask..." Yes, there is a better way. What I'm going to show you may not be the best way, but it's reasonably fast, and exceptionally easy to use. It starts out with this premise. If you know the total number of bytes in the file and know the length of each record, then if you divide the total bytes by the record length, you should get the number of records in the file. Sounds reasonable, right? And it's exactly the way we do it.

For this example, I used a TFileStream object to open up my text file. I like using this particular object because it has come convenient methods and properties that I can use to get the information that I need; in particular, the Size property and the Read and Seek methods. How do I use them? Let's go through some plain English to give you an idea:

Open up a file stream on a text file
Get its total byte size
Now, serially move through the file, byte-by-byte reading each byte into a single-character buffer until you reach a return character (#13).
As you pass each byte, increment a counter variable that will serve as both a file reference point and later, the length of the record.
When you get to the return character, break out of the loop, add 2 to the reference counter (to account for the #13#10 CR/LF pair).
Finally return the result as the file size divided by the record length.

Here's the code that accomplishes the English above:

{======================================================================
This function will give you the exact record count of a file. It uses
a TFileStream and goes through it byte by byte until it encounters
a #13. When it does, it adds 2 to the recLen to account for the #13#10
CR/LF pair, then divides the byte size of the file by the record true
record length.

Note that this will only work on text files.
======================================================================}

function GetTextFileRecords(FileName: string): Integer;
var
  ts: TFileStream;
  fSize,
    recLen: Integer;
  buf: Char;
begin
  buf := #0;
  recLen := 0;
  //Open up a File Stream
  ts := TFileStream.Create(FileName, fmOpenRead);
  with ts do
  begin
    //Get the File Size
    fSize := Size;
    try
      //Move through the file a byte at a time
      while (buf <> #13) do
      begin
        Seek(recLen, soFromBeginning);
        Read(buf, 1);
        Inc(recLen);
      end
    finally
      Free;
    end;
  end;
  recLen := recLen + 2; //Need to account for CR/LF pair.
  Result := Round(fSize / recLen);
end;

As I mentioned above, this may not be the "best" way to do this, but it is a way to approach this problem. A faster way to do this would have been to open up the file as a regular file, then read a bunch of bytes into a large buffer, let's say an Array of Char 4K in size. Perusing through an array is much faster than moving through a file, but the disadvantage there is that you run the risk of having the buffer too small. I've seen some fixed-length ASCII files with line sizes up to 8K.

In any case, the method I presented above may not be the most efficient, but it's safe, and it works. Besides, what's a few milliseconds worth to you? Have at it!

Wait a minute! 10:00PM

Okay, I couldn't resist. I realized that I could've done better than my example above. Here's the method I described immediately above:

function GetTextFileRecords(FileName: string): Integer;
const
  BlockSize = 8192;
var
  F: file;
  fSize,
    amtXfer: Integer;
  buf: array[0..BlockSize] of Char;
begin
  AssignFile(F, FileName); //Open up the text file as an untyped file
  Reset(F, 1);
  fSize := FileSize(F); //Get the file size
  BlockRead(F, buf, BlockSize, amtXfer); //read in up to an 8K block
  CloseFile(F); //close the file, you're done
  Result := Round(fSize / (Pos(#13, StrPas(buf)) + 1));
end;

There are several things different about this function as opposed to the function above. First of all, it involves a lot less code. This is due to not have to perform class constructor; I open up an untyped file, read a big block, get its size, then immediately close it. Notice too that I don't use a loop to find a #13. Instead, I use the StrPas function to convert the array of char into a string that's passed to the Pos function that will give me the position of the return character; thus the record length. Adding one to this value will account for the #10 portion of the CR/LF pair.

Because I don't have to deal with constructing an object, this method is a lot faster than method above, and amazingly it's not very complicated. Where this type of operation can get tricky is with the BlockRead function. In order to use BlockRead successfully, you need to specify a record size. That can be a bit confusing, so just remember this: for byte- by-byte serial reads through a file, always use a record size of 1. Also, notice that I also included a variable called amtXfer. BlockRead fills this with the actual number of bytes read. If you don't supply this, you'll raise an exception when BlockRead executes. That's not too much of a problem because all you need to do is create an exception handling block - but why bother? Just supply the variable, and you don't have to worry about the exception.

Okay, now it's time to close this out... Is this the best way to get the record length of a fixed length text file? Admittedly, it's one of the faster ways save using Assembler. But I'm wondering what a purely WinAPI call set would look like.... If you have any ideas, please make sure to let me know!

Here I Go Again! 11:05 PM

I guess my curiosity got the best of me tonight, because I just wasn't satisfied doing just the BlockRead method. I knew there had to be another way to do it with WinAPI calls. So I did just that. Look at the code below:

function GetTextFileRecordsWinAPI(FileName: string): Integer;
const
  BlockSize = 8192;
var
  F: THandle;
  amtXFer,
    fSize: DWORD;
  buf: array[0..BlockSize] of Char;
begin
  //Open up file
  F := CreateFile(PChar(FileName), GENERIC_READ, FILE_SHARE_READ, nil,
    OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL or FILE_FLAG_NO_BUFFERING, 0);
  fSize := GetFileSize(F, nil); //Get the file's size
  ReadFile(F, buf, BlockSize, amtXfer, nil); //Read a block from the file
  CloseHandle(F);
  Result := Round(fSize / (Pos(#13, StrPas(buf)) + 1));
end;

This method is almost exactly the same as the one immediately above, but instead uses WinAPI calls to accomplish the same task.

Now which method is better? I DON'T KNOW! Actually, for simplicity's sake, I prefer the elegance of the second method - there's just a lot less coding involved. With the WinAPI method, while it may require one less line of code, the CreateFile function is not the easiest thing to work with - I spent a bit of time Alt-Tabbing between the code editor and Windows help to get the syntax and constants right. Granted, it's easier now that I've done it, but it's not a method that I prefer.

So I'll leave it up to you to decide which method you like better.

Nincsenek megjegyzések:

Megjegyzés küldése