Building A Line-by-Line File Reader Using HTML5 and JavaScript

TLDR; In this post I go over how to read files one line at time using the HTML5 File API. You can view the final code on github.

Last week, I came across a question on stackoverflow: how can one read files using the HTML5 File API, one line at a time, without loading the entire file into memory?

Why, one might ask, would this be useful? Before any file data can be processed it needs to be loaded into memory. But chances are if you attempt to load a 300mb text file into memory, your browser will grind to a halt. Thankfully, this can be avoided by reading files in small chunks and breaking each chunk into collections of lines. In this post I will go over how to build a tool to automate this process.

The Outline

In theory, we need to:

  • Read a chunk of the file
    • If the chunk contains a newline character:
      • Split chunk into an array of lines
      • If there is still more data to read:
        • Save the last item in the array, as it may be an incomplete line
      • Emit each line as an event
        • If there are no lines left to emit:
          • Read and parse another chunk
  • If there is no data left to read
    • Emit any stored lines
    • Emit an end event

The Pieces

Before we write any code let’s take a look at the File API and see what’s in our tool box.

Getting File References

References to files can be obtained through input[type="file"] fields. For example, if we have a file input field:

<input id="my-file-input" type="file">

After the user has selected a file we can access a reference to it like so:

var myFileInput = document.getElementById('my-file-input');
var myFile = myFileInput.files[0];

The files property is a FileList object, which is an array-like object of file references. It’s important to note that these are just references to files on the user’s machine and contain no actual file data.

We can, however, access information about the file such as its size.

var myFileSize = myFile.size;

Reading Files

Before we can read any files we need to create a FileReader instance.

var fr = new FileReader();

When a read operation completes the onload event is triggered and the contents of the file are accessible through the result property of our FileReader instance.

fr.onload = function () {
  // 'this' references 'fr', our 'FileReader' instance
  var fileContents = this.result;

  // Do someting with 'fileContents'...
};

Now we can actually read our file using the readAsText() method.

fr.readAsText( myFile );

The slice() method

The most useful FileReader method to us is slice(). This method takes in a start byte and end byte and returns a Blob containing a reference to that portion of the file.

A Blob object represents a file-like object of immutable, raw data.
MDN

If we want to read the first 100 bytes of a file, we can create a new Blob and pass it into the readAsText() method, like so:

var first100 = myFile.slice(0, 100);

fr.readAsText( first100 );

Almost ready

We now have the tools we need to build our LineReader. But first, let’s create a short example of how we’ll use it.

<input type="file" id="file">
<button id="read">Read</button>

<pre id="output"></pre>

<script>

    $(function () {
      var lr = new LineReader({
        chunkSize: 1
      });

      $('#read').click(function () {
        var file = $('#file').get(0).files[0];
        var totalCount = 1;
        var $output = $('#output');

        lr.on('line', function (line, next) {
          $output.text(
            $output.text() + '\n' + 
            totalCount + ': ' + line
          );

          totalCount++;

          /**
           * Simulate some sort of asynchronous operation
           */
          setTimeout(function () {
            next();
          }, 100);
        });

        lr.on('error', function (err) {
          console.log(err);
        });

        lr.on('end', function () {
          console.log('Read complete!');
        });

        lr.read(file);
      });

    });

</script>

Putting everything together

Now that we have a plan and a goal, let’s start building!

The Constructor

First, let’s create some basic constructor boilerplate for our LineReader.

var LineReader = function (options) {
  /**
   * If 'this' isn't an instance of 'LineReader' then the 
   * user forgot to use the 'new' keyword when instantiating
   * 'LineReader'. Let's do it for them; otherwise 'this'
   * will reference the 'window' object
   */
  if ( !(this instanceof LineReader) ) {
    return new LineReader(options);
  }

  /**
   * We'll use '_internals' to store data we don't want 
   * public facing
   *
   * We'll also need a reference to 'this' as it will be 
   * overridden in the 'onload' and 'onerror' events
   */
  var internals = this._internals = {};
  var self = this;
};

Let’s add some internal properties.

var LineReader = function (options) {
  if ( !(this instanceof LineReader) ) {
    return new LineReader(options);
  }

  var internals = this._internals = {};
  var self = this;

  /**
   * Let's create a 'FileReader' instance. We'll only 
   * need one per 'LineReader' instance
   */
  internals.reader = new FileReader();

  /**
   * If 'chunkSize' has been set by the user, use 
   * that value, otherwise, default to 1024
   */
  internals.chunkSize = ( options && options.chunkSize )
    ? options.chunkSize
    : 1024;

  /**
   * Let's create an object to house user defined 
   * event callbacks
   */
  internals.events = {};

  /**
   * 'canRead' will be set to false if the 
   * LineReader#abort method is fired
   */
  internals.canRead = true;

  internals.reader.onload = function () {
    // Process text chunk here...
  };

  internals.reader.onerror = function () {
    // Do something with errors here...
  };
};

Our only available option at the moment is how much of the file to read at a time. We’ll store this in the chunkSize property and default to 1024.

The read() Method

Next, let’s set up our read() method. This method will take on two roles. First, if a file reference is passed in it will set up all of the file specific properties we need. Second, it will create a Blob of size chunkSize, starting at readPos. We will then update the readPos, and read the Blob we just created.

LineReader.prototype.read = function (file) {
  var internals = this._internals;

  /**
   * If 'file' is defined then we want to get its size 
   * and reset 'readPos', 'chunk', and 'lines'
   */
  if (typeof file !== 'undefined') {
    internals.file = file;
    internals.fileLength = file.size;
    internals.readPos = 0;
    internals.chunk = '';
    internals.lines = [];
  }

  /**
   * Extract a section of the file for reading starting 
   * at 'readPos' and ending at 'readPos + chunkSize'
   */
  var blob = internals.file.slice( 
    internals.readPos, 
    internals.readPos + internals.chunkSize 
  );

  /**
   * Update our current read position
   */
  internals.readPos += internals.chunkSize;

  /**
   * Read the blob as text
   */
  internals.reader.readAsText(blob);
};

Utility Methods

Next, we’ll need a way to determine if there is any data left to read. Let’s create a method called _hasMoreData() which will return true if the current position we are reading from is less than or equal to the length of the file.

Note: I’ve prefixed the method name with an underscore to indicate that it is an internal method. This won’t prevent users from messing with it if they really want to, but it’s good enough for our purposes.

LineReader.prototype._hasMoreData = function () {
  var internals = this._internals;
  return internals.readPos <= internals.fileLength;
};

Next, let’s create a method that will allow us to stop the LineReader.

LineReader.prototype.abort = function () {
  this._internals.canRead = false;
};

Events

Great, almost there! Now we need a way to bind and emit events. To bind events we’ll define a method called on which will put a user defined function into the events object that we created earlier.

LineReader.prototype.on = function (eventName, cb) {
  this._internals.events[ eventName ] = cb;
};

This allows us to bind events like so:

myLineReader.on('line', function (line, next) {
  // Do stuff...
});

To emit events we’ll create a method for internal use called _emit which will take in an event name and an array of arguments to pass to the event handler. If the requested event has been bound, we’ll use apply to ensure we use the correct scope and pass in our array of arguments to the callback.

LineReader.prototype._emit = function (event, args) {
  var boundEvents = this._internals.events;

  if ( typeof boundEvents[event] === 'function' ) {
    boundEvents[event].apply(this, args);
  }
};

Sending Lines to the User

Next, we need a method that will step through our array of lines and emit a line event each time. If there are no lines left to emit we’ll read the next file chunk, and if there is no data left or the abort method has been called, we’ll emit the end event. Let’s call this method _step().

LineReader.prototype._step = function () {
  var internals = this._internals;

  /**
   * If there are no lines left to emit and there is still 
   * data left to read, start the read process again, 
   * otherwise, emit the 'end' event
   */
  if (internals.lines.length === 0) {
    if ( this._hasMoreData() ) {
      return this.read();
    }
    return this._emit('end');
  }

  /**
   * If the reading process hasn't been aborted, emit the
   * first element of the line array and pass in '_step'
   * for the user to call when they are ready for the 
   * next line. We have to bind '_step' to 'this', otherwise 
   * it will be in the wrong scope when the user calls it
   */
  if (internals.canRead) {
    this._emit('line', [
      internals.lines.shift(),
      this._step.bind(this)
    ]);
  } else {
    /**
     * If we can't read, emit the 'end' event
     */
    this._emit('end');
  }
};

Handling FileReader Events

Finally, all that’s left is to define our onload and onerror events! For onload we’ll start by appending the result of the read operation to our chunk string. If chunk contains a new line character we’ll split it up into an array of lines. If there is still more data to read we’ll save the last line in the array, as it may be an incomplete line. Then we’ll start the _step() method.

If chunk doesn’t contain a new line character and there is still more data to read we’ll start the read process again, otherwise, we’ll emit what ever is stored in chunk as a line and pass along a function that will emit the end event when the user calls next(). If there is no data stored in chunk we’ll just emit the end event.

internals.reader.onload = function () {

  /**
   * Store the processed text by appending it to any 
   * existing processed text
   *
   * 'this' refers to our 'FileReader' instance
   */
  internals.chunk += this.result;

  /**
   * If the processed text contains a newline character
   */
  if ( /\n/.test( internals.chunk ) ) {
    /**
     * Split the text into an array of lines
     */
    internals.lines = internals.chunk.split('\n');

    /**
     * If there is still more data to read, save the last 
     * line, as it may be incomplete
     */
    if ( self._hasMoreData() ) {
      internals.chunk = internals.lines.pop();
    }

    /**
     * Start stepping through each line
     */
    self._step();

  /**
   * If the text did not contain a newline character
   */
  } else {

    /**
     * Start another round of the read process if there is
     * still data to read
     */
    if ( self._hasMoreData() ) {
      return self.read();
    }

    /**
     * If there is no data left to read, but there is still
     * data stored in 'chunk', emit it as a line
     */
    if ( internals.chunk.length ) {
      return self._emit('line', [
        internals.chunk,
        self._emit.bind(self, 'end')
      ]);
    }

    /**
     * If there is no data stored in 'chunk', 
     * emit the end event
     */
    self._emit('end');
  }
};

Our onerror event handler will be a bit simpler:

internals.reader.onerror = function () {
  /**
   * Emit the error event, passing along the error 
   * object to the event handler
   *
   * 'this' refers to our 'FileReader' instance
   */
  self._emit('error', [ this.error ]);
};

Conclusion

That’s it! We now have a fully functioning line-by-line file reader. Simple, right? If your eyes are currently glazed over, I’d recommend checking out the fully annotated code. Also, if you see any errors or have any feedback, feel free to comment, shoot me an email, or fork the git repo!

Back To Top