Categories:

# Parsing Current Cost XML Data and Storing it in a MongoDB Collection

Well, I’ve moved house again, and once again I’ve set up my electricity meter, and once again I can’t find the code I was using to log the data the previous time. So, I’ve ended up re-writing code to log my current cost data, which I thought I’d blog about if only for the fact that next time I’ll know where I should look to find it.

This blog post will take you through some scripts that I’ve written that gather electrical consumption data from my Current Cost CC-128 device and store that data to a MongoDB database. In an upcoming post, I’ll discuss the code (which is available on GitHub) which will make that data available as JSON for me to use in other applications (such as a web-page that plots my historical consumption).

I’ll get this out the way: this is hardly pioneering work. There have been a number of bloggers that have written about the current cost. Therefore, I’ll begin by giving a very brief note on how this blog post differs. After that, I’ll give a brief introduction to the current cost, then I’ll discuss the XML data that is streamed from the Current Cost device. I’ll then briefly go over the database that I’m using to store the data that I retrieve, then I’ll talk about how I actually retrieve and parse the data. After that, I’ll talk about a script that I’ve written that allows you to build up a query which returns the data in JSON format, and finally I’ll turn my attention to some javascript and HTML that renders the data in the form of some line plots.

## Previous works

While I’m sure there are a number of bloggers that have written about the current cost, there are two in particular that I found particularly helpful in developing the code in this post. Firstly, Paul Mutton’s website was of great help. This site uses a fairly simple PERL script that uses regular expressions to match the XML strings being streamed from the Current Cost. This is a good method in that it makes for a really tiny PERL script. However, it makes the script fragile to changes in the XML strings that are received from the device, as a slight change such as a change in element order, data format or number of elements could cause the regular expression to no longer match the XML string, resulting in data being lost.

Jamie Bennett adapted Mutton’s script for use with the Current Cost’s USB lead (whereas Mutton’s approach used a serial lead). However, I found that the baud rate used on Bennett’s wasn’t correct for my device. I’m not sure why this is, it could be that current cost have updated their hardware since Bennett’s post in 2008.

Both of these approaches store their data in an RRDB. More precisely they use RRDTool. This is beneficial in that the database filesize remains constant as old data-points are automatically overwritten as new data-points are entered into the database (obviously the number of records you keep is configurable). In the past, I’ve used this approach too, but I’ve found it to be rather painful when I wanted to play about with the data. Therefore, in these scripts I’ve decided to use a MongoDB database to store each record received from the current cost. Of course, this means that over time my database will consume more hard disk space but on the other hand, it makes it much easier for me to retrieve and play with the high resolution data.

## CurrentCost

The current cost that I’m using is the CC128. I bought it a while ago, and you can get them fairly cheaply from ebay. You can optionally purchase a cable that plugs into a USB socket on a PC, allowing the current cost to stream data from the current cost to the PC. The data is streamed in an XML format, which is described next.

## CurrentCost XML Data

There are a couple of xml messages that the current cost sends out. The one in which we’re most interested here is the data that says the current consumption and temperature. This is sent out more-or-less every 5 seconds, and looks like the following:

<msg>
<src>CC128-v0.11</src>
<dsb>01056</dsb>
<time>14:52:26</time>
<tmpr>23.4</tmpr>
<sensor>0</sensor>
<id>02628</id>
<type>1</type>
<ch1>
<watts>00252</watts>
</ch1>
<ch2>
<watts>00000</watts>
</ch2>
</msg>


We can see here a number of things. The “time” element holds the time at which the current cost believed the record was sampled, the “tmpr” element holds the sampled temperature in degrees centigrade, the “sensor” element holds the ID of the sensor from which the values in this message were sampled (for me this is always sensor 0, indicating the main current sensor on my meter. If you have sensors plugged into other devices, then this number will increase to indicate which device sensor the consumption is sampled from).

It can also be seen that I have two <ch[0-9]> elements, which have “watts” elements as children. My transmitter can hold up to three current “clamps”, which detect the current passing through my mains cables. Currently, I’m using only two of those clamps – one for my normal mains, and another for an eco7 circuit, which turns on over night. Ch1 is my ‘normal’ mains circuit, ch2 is my eco7 circuit.

It can therefore be seen that I’m currently consuming 252 watts of electricity. Ideally, I’d like to capture this data, parse it and then store it into a database. In the next section, I will discuss the MongoDB database that I’m using to do this.

## The MongoDB Database

I started off this project using a MySQL database. However, in the end I decided to move to a nosql solution. There were two primary reasons for this: 1) I hadn’t played with NoSQL solutions before and I wanted to experiment, 2) I wanted a solution that would be able to easily grow to accept different sensors. With regards to the second point, I believe a NoSQL solution is appropriate here because they are (in some cases) schema-less. That is, whereas with traditional relational databases (take for example, my original database) you would specify that a sensor reading has an associated sensor id, time and value and all sensor values to be inserted into the database must contain all of these values and nothing more, with a schema-less database a reading from one sensor may contain a sensor id, time and value, whereas another might contain a sensor id, time, value and value2 (for example). Thus the database can quite easily accommodate any data the sensors throw at it.

When I first started thinking about it, I thought that this wasn’t a good idea. I thought, if there’s no schema, than how can you query the database? How do you know what to look for? Well, I still think this is a problem, and in order for data to be queryable via the REST interface I describe later in this post, the data must be in a specific format (which of course undermines my argument above). However, in principle, my argument re the extensibility of the database still stands. The database will accept any data, but the application must understand how to query that data.

### Experiences with MongoDB

So far I’ve got to say, I’m very happy with MongoDB. My “sensordata” database is standing at 7.95GB, and still going strong.

> show databases;
sensordata      7.9501953125GB


I have a number of collections in my sensordata database. The main collection into which my sensor scripts insert their data is sensordata2. The other sub-collections are produced by a map-reduce operation that takes place every 5 minutes.

> use sensordata;
switched to db sensordata
> show collections
sensordata2
sensordata2.daily
sensordata2.hourly
sensordata2.minutely
sensordata2.monthly


I’ll first discuss the basic document structure that I use for sensor data records, then I’ll talk briefly about indexing, followed by the map reduce operations that I perform in order to calculate aggregate values.

### Sensor Record Document Structure

Selecting a record from MongoDB is really simple. For this section, I’m only interested in any arbitrary record within my database (as they all follow the same basic structure). Therefore, I use the “findOne” method, as follows:

> db.sensordata2.findOne()


which returns:

{
"_id" : {
"time" : ISODate("2013-08-05T18:25:17Z"),
"sensor" : NumberLong(0),
"type" : "temperature"
},
"value" : 24
}


We see here that the document’s “_id” field is a complex object consisting of a time, a sensor id and a type (which for the moment is just a string). The “type” field allows an application to query for a specific set of data. Currently in my flat I’m able to get temperature or humidity values. In addition, the document contains a “value” field, which contains a numeric value.

Just as an example of a query (this isn’t an introduction to MongoDB, but I’ll show this just so people can see what my humidity sensor records look like), here is an example of how to find an arbitrary humidity record:

> db.sensordata2.findOne({"_id.type": "humidity"})

{
"_id" : {
"time" : ISODate("2013-08-05T18:39:36Z"),
"sensor" : NumberLong(100),
"type" : "humidity"
},
"value" : 59.05
}


Finding just an arbitrary record is all well and good. However, most of the time, we’re interested in finding records that match a certain set of criteria. With small databases, this is fairly simple (and indeed with larger databases too, the commands remain the same). However, with larger databases, unless sufficient thought is given to indexing, querying the data can become incredibly slow. I discuss the indexing I’m using in this database in the next section.

### Indexing

The amount of indexing I’ve done on this database is still fairly minimal. By default MongoDB indexes the _id field, but in the case of complex _id fields, the child fields are not automatically indexed. With my database, as is probably the case with almost any database that contains temporal data, most queries will focus on restricting the result set based upon the records’ times. Consequently, through running the operation:

> db.sensordata2.ensureIndex( { "_id.time": 1 } )


I can build an ascending-order index on the time field of the ID field.

However, given that the most interesting information for me will probably be the latest sensor readings, most of my queries will be of the form:

> db.sensordata2.find(someQuery).orderBy("_id.time": -1).limit(x)


i.e. get me the latest x data values. If we have no descending order index on “_id.time”, we’d find that the sort operation takes a very long time. Therefore, I decided to build a reverse-order index over the “_id.time” field, by executing the following:

> db.sensordata2.ensureIndex( { "_id.time": -1 } )


That’s actually all I’ve done regarding indexes on my sensordata2 collection. I’m considering adding another index on the “_id.type” field, but I’ve yet to do it. To view the indexes that are currently on the sensordata2 collection, you can run the command:

> db.sensordata2.getIndexes()


which returns:

[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "sensordata.sensordata2",
"name" : "_id_"
},
{
"v" : 1,
"key" : {
"_id.time" : -1
},
"ns" : "sensordata.sensordata2",
"name" : "_id.time_-1"
}
]


Here we see the automatic index on the _id field, that MongoDB generates, and the reverse order index that I created over the _id.time field.

Querying for records based on timestamps should now be fairly quick. However, a lot of the time, I won’t need the full resolution of the data that’s being held in the database. Furthermore, for queries ranging over long periods of time, a 5 second resolution could result in a lot of data. To manage this, we can aggregate sensor values into means over time periods (for example, the mean consumption in an hour). We could do this on the fly if we were feeling particularly insane. Alternatively, we can get the database to calculate these aggregates for us using map reduce operations, which I discuss in the next section.

### Aggregating Data Using Map Reduce Operations

MongoDB provides a couple of methods for aggregating data — the aggregation pipeline and map reduce. The aggregation pipeline is the simpler method of the two but, from what I’ve seen, performs the aggregation operations on the fly. This is fine when aggregating over small sets of data, but I may potentially want to select means over years worth of values (when I get that much data), and I may be making these ‘aggregate queries’ frequently.

Given this, it seems to me that the better option is to use MongoDB’s Map Reduce framework to perform the aggregation.
This is relatively simple: we provide a “map” function, which assigns a set of records to a given key, and then we provide a “reduce” operation, which performs the aggregation of all items with the same key into a single value.

Take, for example, the javascript below which defines a map operation that removes the seconds and milliseconds from the time field of the ‘raw’ (i.e. non-aggregated) records. Consequently, any records that occur within the same minute are assigned to the same key. The “emit” instruction outputs the record specified as its second parameter with the key specified as its first parameter to the map-reduce pipeline.

  var map_minutes = function() {
var key = {
time: new Date(
this._id.time.getFullYear(),
this._id.time.getMonth(),
this._id.time.getDate(),
this._id.time.getHours(),
this._id.time.getMinutes(),
0, 0),
sensor: this._id.sensor,
type: this._id.type
};

emit(key,
{
latestTime: this._id.time,
count: 1,
max: this.value,
min: this.value,
value: this.value
}
);
};


The reduction operation in this instance takes the sum (not the mean) of the values, and also keeps a record of the max and min values of those records assigned to the same key

 var reduce_sensordata = function(key, values) {
var r = {
latestTime: new Date(),
count: 0,
max: 0,
min: 99999,
value: 0
};

values.forEach(function(v) {
if(v.value > r.max) r.max = v.value;
if(v.value < r.min) r.min = v.value;
if(v.latestTime > r.latestTime) r.latestTime = v.latestTime;
r.value += v.value;
r.count += v.count;
});

return r;
};


Finally, a finalize function is used to perform the actual calculation of the mean based on the sum values and counts calculated by the reduction operation:

  var finalize_mean_sensordata = function(key, value) {
if(value.count > 0) {
value.value /= value.count;
}
return value;
};


In addition to the above methods (and a number of other similar methods, map_hours, map_days etc.) I have a helper method which actually kicks off the map reduce operation for me. I’ll dump it all here, but I’ll explain it in more detail in the following paragraph:

 var doMeanSensorMapReduce = function(mapfn, collection) {
var getSince = db.sensordata2[collection].find().sort({"value.latestTime":-1}).limit(1);

var sinceYear = 1900, sinceMonth = 0, sinceDay = 1, sinceHour = 0, sinceMinute = 0;

var since;

// the last time map-reduce was run, the last period may have not been complete.
// start the map-reduce at the beginning of the last period.
if (getSince.count() >  0) {
var lt = getSince[0].value.latestTime

sinceYear = lt.getFullYear();
switch(collection)
{
case 'minutely':
sinceMinute = lt.getMinutes();
case 'hourly':
sinceHour = lt.getHours();
case 'daily':
sinceDay = lt.getDate();
case 'monthly':
sinceMonth = lt.getMonth();
}
}

since = new Date(sinceYear, sinceMonth, sinceDay, sinceHour, sinceMinute, 0,0);

db.sensordata2.mapReduce(
mapfn,
reduce_sensordata,
{
out : { merge: ("sensordata2."+collection) },
query: {"_id.time" : {"$gt" : since} }, finalize: finalize_mean_sensordata, } ); };  The most important operation to note here is the one on lines 30-36. This performs the actual map reduce operation. The function in the script takes a function as a parameter (mapfn), which, for the sake of example, lets say is the map_minutes function described above. This is passed as the first parameter to the mapReduce operation on line 30, and the reduction function described above is passed as the second parameter. The third parameter contains a number of options the fine-tune the map reduce operation. The first of the configuration options that we see is the “out” field, which defines the collection into which the records output by this map-reduce operation will be placed. In the example above, the collection name is actually taken as a parameter. For example, for me, data aggregated over each minute is placed into the “minutely” sub-collection of sensordata2 (so the “collection” parameter here would be “minutely”). Note also, that we’re telling map-reduce to merge the data into the collection. This tells map-reduce that if a record already exists with the same key in the destination collection, that record should be overwritten with the value from the current map-reduce operation. The query field specifies the query that should be used to limit the set of data that is being worked upon. If this isn’t specified, then the map-reduce operation would be performed over the entire database each time it is run. However, in the script above, we set the query to limit the data to the rows that were recorded since the beginning of the last period in which a map-reduce operation had taken place (we choose to do it from the beginning rather than the end as the map-reduce operation might have been performed part-way through that time period, meaning some values might not have been correctly aggregated). Finally, the finalize field specifies our finalize method, which performs the final calculation of our mean based upon the sums calculated during the reduce phase of our map-reduce operation. If this isn’t specified, the records stored into the database would be those returned by the reduce method. This file can be downloaded from my GitHub Gist. That more-or-less covers the database. Now I’ll very quickly describe the PERL script that I wrote (hacked together) to get data into the database from the CurrentCost. After that, I’ll describe the PHP interface I wrote to allow people to query the data, and then I’ll show the javascript that is used to query the PHP interface and render plots based on the returned data. ## A PERL Script to Parse the CurrentCost Data and Insert it into MongoDB As far as sub-headings go, this one probably isn’t the snappiest. However, it describes exactly what I’m going to be showing here. Below is the script that I use to read the XML data described above from my current cost and then to parse and store that data in my MongoDB database. The main points of interest here are lines 45-47, 55-67 and 77-113 and 133-144. I’ll discuss these blocks in more detail after the code. #!/usr/bin/perl -w use strict; use Device::SerialPort qw( :PARAM :STAT 0.07); use XML::Parser; use MongoDB; use MongoDB::MongoClient; use MongoDB::OID; use DateTime; # Beginning of configuration my$PORT = "/dev/CurrentCost";
my $databaseName = "sensordata"; my$collectionName = "sensordata2";
my $dbhost = "localhost"; my$logfile = "/var/log/currentcost";
my $detatch = 0; my$lockfile = "/.currentcost";

# End of configuration

open( LOGFILE, ">>$logfile"); #detatch from the console my$depth = 0;

my $parser = new XML::Parser (Style => 'Tree'); sub log{ my ($message) = @_;
my @time = localtime(time);
my $year = 1900 +$time[5];
print LOGFILE sprintf("[%04u-%02u-%02u %02u:%02u:%02u] %s\n",$year,$time[4],$time[3],$time[2],$time[1],$time[0], $message); } sub printAtDepth{ my ($depth, $string) = @_; for(my$i = 0; $i <$depth; $i ++) { print "\t"; } print$string;
}

my $dbClient = MongoDB::MongoClient->new( host =>$dbhost );
my $db =$dbClient -> get_database( $databaseName ); my$collection = $db -> get_collection($collectionName );

my $watts = 0; my$channel = 0;
my $sensor = 0; my$value = "";

sub addSensorValue{ my ($type,$sensor, $value) = @_; MongoDB::force_double($value);
MongoDB::force_int($sensor); my$time = DateTime -> now;

$collection -> insert( { "_id" => { "time" =>$time,
"sensor" => $sensor, "type" =>$type
},
"value" => $value, } ); } sub addConsumption{ my ($sensor, $channel,$watts) = @_;
&addSensorValue("electricity",10+$channel,$watts);
}

sub addTemperature{ my ($sensor,$degreesc) = @_;
&addSensorValue("temperature",$sensor,$degreesc);
}

sub processTree{ my ($tag,$content) = @_;

if("$tag" eq "sensor") {$sensor = $content->[2]; } elsif($tag =~ m/ch([0-9])/)
{
$channel =$1;
&processTree(@$content[1,2]); &addConsumption($sensor,$channel,$watts);
}
elsif("$tag" eq "watts") {$watts = $content->[2]; } elsif( "$tag" eq "hist")
{
return;
}
elsif( "$tag" eq "tmpr") { &addTemperature(0,$content->[2]);
}
elsif( ref $content ){ my$attributes = $content->[0]; # content is an xml element. for(my$i = 1; $i <$#$content;$i+=2)
{
&processTree(@$content[$i,$i+1]); } } else {$value = $content; } } if( -e$lockfile )
{
print "Lock file $lockfile exists. Delete this before starting.\n"; exit(0); } if($detatch)
{
if(fork())
{
exit(0);
}
&log("Detatched from console.");
}
&log("Sleeping for 10 seconds.");
sleep(10);
&log("Attempting to open socket.");
my $dev = new Device::SerialPort($PORT) or &log("Unable to open socket. $!");$dev->baudrate(57600);
# $dev->write_settings; open(SERIAL, "+>$PORT");

while(my $line = <SERIAL>) { chomp$line;
my $tree =$parser->parse($line); &processTree(@$tree);
}


There are a number of important things to take away from this code.

1. Connecting to the CurrentCost and listening for data
2. Parsing the CurrentCost XML Data
3. Connecting to MongoDB
4. Inserting the CurrentCost Data into the MongoDB Database

I’ll discuss these in order in the following sections.

### Connecting to the CurrentCost and Listening for Data

As far as this is concerned, the script doesn’t differ hugely from the scripts in the “previous work” section above. However, I’ll describe it here for completeness. On my computer, I’m using a UDEV rule in order to create a specific device file for my current cost — /dev/CurrentCost. I’ve done this by creating a file /etc/udev/rules.d/75-currentcost.rules which has the following content:

ATTRS{idVendor}=="067b", ATTRS{idProduct}=="2303", NAME="CurrentCost", MODE="0666", RUN += "/usr/bin/currentcost.pl"


This detects when the current cost has been plugged into the USB socket on my computer, creates a /dev/CurrentCost file with the file permissions 0666. In addition, it runs the /usr/bin/currentcost.pl command, the content of which is the PERL above.

With that, the PERL script begins to execute the following code:

my $dev = new Device::SerialPort($PORT) or &log("Unable to open socket. $!");$dev->baudrate(57600);

open(SERIAL, "+>$PORT"); while(my$line = <SERIAL>)
{
#...
}


where “$PORT” is “/dev/CurrentCost”. It is important to note the baud rate used here: 57600. This is the baud rate that should be used for serial communications with the current cost over USB. It differs for other methods. Once the serial port has been opened for reading, the PERL script begins reading data from the port line by line. For each line, it calls the processTree subroutine, whose job it is to parse the XML from the current cost and then to insert it into the MongoDB database. ### Parsing the CurrentCost XML Data The XML that is received from the CurrentCost is described in a section above. The job of parsing it goes to PERL’s “XML::Parser” library and the processTree subroutine. The$parser->parse($line) call returns the XML in$line as a tree structure, which processTree then works upon:

sub processTree{ my ($tag,$content) = @_;

if("$tag" eq "sensor") {$sensor = $content->[2]; } elsif($tag =~ m/ch([0-9])/)
{
$channel =$1;
&processTree(@$content[1,2]); &addConsumption($sensor,$channel,$watts);
}
elsif("$tag" eq "watts") {$watts = $content->[2]; } elsif( "$tag" eq "hist")
{
return;
}
elsif( "$tag" eq "tmpr") { &addTemperature(0,$content->[2]);
}
elsif( ref $content ){ my$attributes = $content->[0]; # content is an xml element. for(my$i = 1; $i <$#$content;$i+=2)
{
&processTree(@$content[$i,$i+1]); } } else {$value = $content; } }  The code here is fairly straight-forward: process tree expects data to come in {tag, content} pairs (the content may be more tag, content pairs), it looks at the tag and decides what to do with the content based upon that: if the tag is “sensor” then it assigns the$sensor id appropriately; if the tag is “watts” it sets the $watts variable; if the tag is “hist”, this corresponds to historical data the current cost periodically sends out, and we ignore it; if the tag is “tmpr” we immediately add the temperature data to the database; otherwise, if$content is a reference then $content is pointing to an xml tree, which we must process, else we just assign the data to$value. Apparently this seemed reasonable at the time, but I can’t see what I’m actually doing with that data now… I guess we can consider this data to be ignored.

The addConsumption and addTemperature subroutines above both call the addSensorValue subroutine. This uses a connection to MongoDB and inserts the sensor data into the sensordata2 collection described above. This process is discussed in the next section.

### Connecting to MongoDB

Connecting to MongoDB within PERL is incredibly easy: There are a couple of modules you need to include, (see lines 6,7 and 8 in the script above), but after that it’s incredibly simple. Simply make a connection to the MongoDB server, get the appropriate database, then get the collection you’d like to use. This is all done in lines 45,46 and 47 above.

Inserting the data is also incredibly simple, and is described below.

### Inserting the CurrentCost Data into the MongoDB Database

Once you’ve connected to MongoDB and you’ve got your collection object, inserting the data into your collection is as simple as passing a hash to a subroutine. The only slight hiccup for me was the fact that the PERL MongoDB library attempts to infer the type of the data you’re insert, and it is not always 100% accurate.

This can be remedied really easily by using the MongoDB::force_* subroutines, as is shown in the code below:

sub addSensorValue{ my ($type,$sensor, $value) = @_; MongoDB::force_double($value);
MongoDB::force_int($sensor); my$time = DateTime -> now;

$collection -> insert( { "_id" => { "time" =>$time,
"sensor" => $sensor, "type" =>$type
},
"value" => $value, } ); }  This tells MongoDB that$value is a double and $sensor is an integer. Easy. After that you simply create a hash whose structure matches the structure of the json document you’d like to insert into the MongoDB database, and then simply call$collection->insert(that hash);.

In terms of collecting, parsing and storing the data, that’s all there is to it (as far as my current implementation is concerned). In the next blog post, I’ll write about the PHP interface I have to query the data and the javascript I’m using to render the data into plots.

Posted in Data, Programming, Sensors.

## 0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Some HTML is OK