HydrOffice news

Stream and TopBlocks: exploring the content of a data file

Feb. 6, 2015, 11:14 p.m., author: gmasetti

Overview

 
This post uses the Huddl-generated driver to access and collect information relative to the content of a survey data file. As example data set, we use the same Kongsberg data file described here. Directions about the creation of a Python module for this data format are provided here. Before using the Huddl module and coding up the statistics collection, a brief overview on how Huddl works under the hood.
 

Conceptual and Physical Data Modelling

 
Huddl was developed as a community-specific, format-oriented data description language. These characteristics provide a certain level of simplification since existing data formats are different answers to the same problem: fast storage of data acquired in real-time. All the data format specifications targeted by Huddl have three components which formed the requirements for the abstract conceptual model and physical implementation reported here:
  • The semantic; that is: what a given value collected in the data format actually means (e.g., the unit of measure);
  • The physical description: how the bits and the bytes are stored on disk (e.g., endianess, memory alignment); and
  • The logical structure: that is, what data structures are used to organize the data (e.g., an array)?
      Huddl uses XML to give a structural description of the contents of a file format (rather than the content of a particular file). Coupling one or more of the proposed descriptive XML schemas with a given hydrographic dataset, as metadata, provides a detailed definition on how data have been actually stored.
 

Generic Logical Structure

 
The logical structure is defined in the Content branch of an a Huddl Format Description (HFD), used to describe the binary data file in a platform-independent way. The Content lists all the releases of a data format (Streams). For a given release of data format, a Stream contains the overall composition of a data file:
  • A Header which describes initial shared data fields present in all top-level blocks;
  • A TopBlocks list with all the blocks that can be encountered at the top level of the format; and
  • An optional Tail description, representing a common data element found at the end of all top-level blocks (e.g., a checksum).
Thus, it is possible to have in the same document multiple streams that reflect different version releases of the same data format, and hence updates are only required to be incremental (this characteristic makes them smaller, simpler, and faster than would be required for a single-release format description approach).
 

Kongsberg-specific Structure

 

The Kongsberg EM Series format has a structural representation similar to many other hydrographic data formats (eXtended Triton Format, Generic Sensor Format, etc.) that are commonly used. At a high level of abstraction, a Stream is just a container for asynchronous TopBlocks.            

A TopBlock is called a datagram in the Kongsberg EM Series format specification. In general, all datagrams start with an identification byte (STX equal to value 0x02), datagram type and time tag, and end with another identification byte ("ETX" equal to value 0x03) and checksum (sum of bytes between STX and ETX). In addition the total length of the datagram (not including the length field) will precede the STX byte, given as a four byte binary number. The above information are carried out by two Blocks as shown in the following Huddl XML snippets:

    <huddl:block name="emhdr">
        <huddl:field name="datasize" type="huddl:u32" minValue="100" maxValue="4100"/>
        <huddl:field name="stx" type="huddl:u8" minValue="2" maxValue="2"/>
        <huddl:field name="type" type="huddl:u8"/>
        <huddl:field name="modelnum" type="huddl:u16"/>
        <huddl:field name="date" type="huddl:u32"/>
        <huddl:field name="time" type="huddl:u32" minValue="0" maxValue="86399999"/>
    </huddl:block>

    <huddl:block name="emtail">
        <huddl:field name="etx" type="huddl:u8" minValue="3" maxValue="3"/>
        <huddl:field name="checksum" type="huddl:u16"/>
    </huddl:block>
The above Blocks are used as head and as tail, respectively, for each of the TopBlocks. Each TopBlock can be encountered multiple times at the Stream's top level. For instance, the following is the XML snippets describing the Stream for the release R of the format:
    <huddl:stream revID="revR" scope="rev19" resynch="1024" reclen="1048576">
        <huddl:header refBlock="emhdr" discriminator="type"/>
        <huddl:topBlocks>
            <huddl:topBlock refBlock="em96procunit_revQ" alias="pu_broadcastQ" identifier="0x30"/>
            <huddl:topBlock refBlock="em96heartbeat_revR" alias="pu_statusR" identifier="0x31"/>
            <huddl:topBlock refBlock="em96instparam_revR" alias="data_out_onR" identifier="0x32"/>
            <huddl:topBlock refBlock="em96extraparam_revQ" alias="extraparams" identifier="0x33"/>
            <huddl:topBlock refBlock="em96attitude_revA" alias="attitude" identifier="0x41"/>
            <huddl:topBlock refBlock="em96bist" alias="bist" identifier="0x42"/>
            <huddl:topBlock refBlock="em96clock" alias="clock" identifier="0x43"/>
            <huddl:topBlock refBlock="em96depth_revA" alias="depth" identifier="0x44"/>
            <huddl:topBlock refBlock="em96sbeam" alias="singlebeam" identifier="0x45"/>
            <huddl:topBlock refBlock="em96rawrt" alias="rawrt" identifier="0x46"/>
            <huddl:topBlock refBlock="em96surfacess" alias="surfss" identifier="0x47"/>
            <huddl:topBlock refBlock="em96heading" alias="heading" identifier="0x48"/>
            <huddl:topBlock refBlock="em96instparam_revR" alias="start_iparamR" identifier="0x49"/>
            <huddl:topBlock refBlock="em96mechtilt" alias="mechtilt" identifier="0x4A"/>
            <huddl:topBlock refBlock="em96cenbeam" alias="echograms" identifier="0x4B"/>
            <huddl:topBlock refBlock="em96rawrange78_revQ" alias="rawrange78Q" identifier="0x4E"/>
            <huddl:topBlock refBlock="em96quality" alias="quality" identifier="0x4F"/>
            <huddl:topBlock refBlock="em96position_revG" alias="position" identifier="0x50"/>
            <huddl:topBlock refBlock="em96runtime_revQ" alias="runtimeQ" identifier="0x52"/>
            <huddl:topBlock refBlock="em96seabed_revF" alias="imageryF" identifier="0x53"/>
            <huddl:topBlock refBlock="em96tide" alias="tide" identifier="0x54"/>
            <huddl:topBlock refBlock="em96hrsvp" alias="hrsvp" identifier="0x55"/>
            <huddl:topBlock refBlock="em96svp" alias="svp" identifier="0x56"/>
            <huddl:topBlock refBlock="em96sspout_revQ" alias="sspout" identifier="0x57"/>
            <huddl:topBlock refBlock="em96xyz88_revR" alias="xyzR" identifier="0x58"/>
            <huddl:topBlock refBlock="em96seabed89" alias="backscatter" identifier="0x59"/>
            <huddl:topBlock refBlock="em96range" alias="range" identifier="0x65"/>
            <huddl:topBlock refBlock="em96rawrange" alias="rawrange" identifier="0x66"/>
            <huddl:topBlock refBlock="em96height" alias="height" identifier="0x68"/>
            <huddl:topBlock refBlock="em96instparam_revR" alias="stop_iparamR" identifier="0x69"/>
            <huddl:topBlock refBlock="em96watercolumn_revQ" alias="watercolumnQ" identifier="0x6B"/>
            <huddl:topBlock refBlock="em96netatt_revJ" alias="netattitude" identifier="0x6E"/>
            <huddl:topBlock refBlock="em96instparam_revR" alias="remote_iparamR" identifier="0x70"/>
            <huddl:topBlock refBlock="em96isvp" alias="isvp" identifier="0x76"/>
        </huddl:topBlocks>
        <huddl:tail refBlock="emtail"/>
    </huddl:stream>
 

Python script

 
The Huddl-based Python module is used to access, to recognize and to count all the datagrams present in the data file. Incidentally, reaching the end of the file without error in the parser also represents a first rough validation for the data content. The script uses several commonly used modules such as matplotlib for plotting utilities and some standard Python libraries such as datetime and operator.
In [1]:
import kongsberg_em_series as kng
import os.path
import datetime
import operator
import matplotlib
from matplotlib import pyplot as plt
from matplotlib import rcParams
from matplotlib import ticker
 
Some helper methods are used to manage the Kongsberg timetags:
In [2]:
def fix_date(date):
    """ convert Kongsberg date """
    day = date % 100
    month = date / 100
    year = month / 100
    date = str("%.4d/%.2d/%.2d" %(year % 10000, month % 100, day % 100))
    return date

def fix_time(time):
    """ convert Kongsberg time """
    msec = time % 1000
    sec = time / 1000
    mins = sec / 60
    hr = mins / 60
    time = str("%.2d:%.2d:%.2d.%.3d" %( hr % 24, mins % 60, sec % 60, msec))
    return time

def make_datetime(date, time):
    """ convert Kongsberg date and time to Python datetime """
    time_string = fix_date(date) + " "+ fix_time(time) + "Z"
    dt = datetime.datetime.strptime(time_string, "%Y/%m/%d %H:%M:%S.%fZ")
    return dt
 
The reading mechanism to access the data file content is quite simple, and it has been already presented here. We first open the raw file, then we check if the opening has been successful:
In [3]:
# Open the data file
testfile_path = "C:/.../surveyline.all"
if not os.path.isfile(testfile_path):
    raise IOError("the file %s does not exist" % testfile_path)
file_size = os.path.getsize(testfile_path)
ip = kng.fopen(testfile_path, "rb")
if ip is None:
    raise IOError("data file NOT successfully open")
print("> data file successfully open")
position = kng.ftell(ip)
print("> initial position: ", position)
 
> data file successfully open
> initial position:  0
 
Some variables are here declared that will be populated in the next code snippets:
In [4]:
# counters, block dictionary and other information
datagram_dict = dict()
model_num = None
start_time = None
end_time = None
 
To start with a basic example, we read just the first TopBlock (or datagram, as it is called in the Kongsberg specifications):
In [5]:
# Read the first datagram
n_read = 0
datagram_nr = 0
data = kng.kng_revR_t()
rc, n_read = kng.read_kng_revR(ip, data, n_read, kng.validate_emhdr)
print("> read bytes after 1st datagram: ", n_read)
if rc == kng.HUDDLER_TOPBLOCK_OK:
    print("> content: #", datagram_nr, "\t->\tbytes: ", n_read, "\ttype: ", data.header.type, "[", hex(data.header.type), "]")
    model_num = data.header.modelnum
    start_time = make_datetime(data.header.date, data.header.time)
else:
    raise IOError("error in reading the 1st datagram: ", rc)
kng.clean_kng_revR(data)
position = kng.ftell(ip)
 
> read bytes after 1st datagram:  926
> content: # 0 	->	bytes:  926 	type:  73 [ 0x49 ]
 
The reading of the first TopBlock has been successful, and the output shown provides some basic information as its type (0x49, thus a Start Installation Parameters datagram) and the size (926 bytes). After having conceptually shown the reading of just a TopBlock, we introduce a loop that goes through all the content datagrams looking for datagram types and counting them using the previously declared dictionary:
In [6]:
# Loop to read all the datagrams
abort = False
resync_event = 0
kng.fseek(ip, 0, 0) # to go back at the beginning of data file
print("> looping through the whole data file: ", n_read)
while not abort:
    rc, n_read = kng.read_kng_revR(ip, data, n_read, kng.validate_emhdr)

    if rc == kng.HUDDLER_TOPBLOCK_OK:
        if datagram_nr < 5: # to print only the first blocks
            print("  #", datagram_nr, "\t->\tbytes: ", n_read, "\ttype: ", data.header.type, "[", hex(data.header.type), "]")
        datagram_nr += 1
        datagram_dict[data.header.type] = datagram_dict.get(data.header.type, 0) + 1
        if n_read != (data.header.datasize + 4):
            print("> #", datagram_nr," -> repositioning")
            kng.fseek(ip, position, 0)                  # Return to position before read
            kng.fseek(ip, data.header.datasize + 4, 1)  # Skip size + 4 (the length word in the datagram)

    elif rc == kng.HUDDLER_TOPBLOCK_UNKNOWN:
        print("> #", datagram_nr, " -> UNKNOWN datagram ", hex(data.header.type), " at offset ", position)
        datagram_nr += 1
        kng.fseek(ip, position, 0)                  # Return to position before read
        kng.fseek(ip, data.header.datasize + 4, 1)  # Skip size + 4 (the length word in the datagram)

    elif rc == kng.HUDDLER_NO_HEADER:
        print("> #", datagram_nr, " -> no header at offset ", position, " -> resynch #", resync_event)
        resync_event += 1

    elif rc == kng.HUDDLER_TOPBLOCK_READFAIL:
        print("> #", datagram_nr, " -> fail reading datagram at offset ", position, " -> resynch #", resync_event)
        resync_event += 1

    elif rc == kng.HUDDLER_INVALID_TAIL:
        print("> #", datagram_nr, " -> invalid tail at offset ", position, " -> resynch #", resync_event)
        resync_event += 1

    elif rc == kng.HUDDLER_EOF_WHILE:
        end_time = make_datetime(data.header.date, data.header.time)
        print("  . . .\n"
              "  #", datagram_nr, "->\tend of file at offset ", position)
        abort = True

    elif rc == kng.HUDDLER_READ_ERROR:
        print("> #", datagram_nr, " -> unexpected read error at offset ", position, " -> aborting")
        abort = True

    else:
        print("> unknown read return: ", rc)
        abort = True

    kng.clean_kng_revR(data)
    position = kng.ftell(ip)

# Closing the data source
print("> last position read in file: ", position, " [file size: ", file_size, "]")
if position == file_size:
    print("> the whole data file has been traversed")
else:
    print("> WARNING: the whole data file has NOT been traversed !!")
ret = kng.fclose(ip)
if ret == 0:
    print("> data file closed")
 
> looping through the whole data file:  926
  # 0 	->	bytes:  926 	type:  73 [ 0x49 ]
  # 1 	->	bytes:  56 	type:  82 [ 0x52 ]
  # 2 	->	bytes:  56 	type:  82 [ 0x52 ]
  # 3 	->	bytes:  56 	type:  82 [ 0x52 ]
  # 4 	->	bytes:  1012 	type:  85 [ 0x55 ]
  . . .
  # 24127 ->	end of file at offset  139303668
> last position read in file:  139303668  [file size:  139303668 ]
> the whole data file has been traversed
> data file closed
 
All the TopBlocks in the data file have been visited (we limited the visualization to just the first 5 datagrams). As a double check, we compare the file size with the number of read bytes, and they match. All the collected information (sonar model, start and end time of data storage, TopBlocks types and counts) is then displayed:
In [7]:
# Provide some info about the data file
print("\n> data file content:")
print("  . model number:\t", model_num)
print("  . start time:  \t", start_time)
print("  . end time:    \t", end_time)
print("  . time length: \t", end_time-start_time, "[h:mm:ss.s]")
print("  . total blocks:\t", datagram_nr)
print("  . blocks list:")
print("  [type] --> [counter]")
for d in reversed(sorted(datagram_dict, key=datagram_dict.get)):
    print("   #%s --> %s" % (hex(d), datagram_dict[d]))
print("")
 

> data file content:
  . model number:	 2040
  . start time:  	 2012-06-06 17:59:18.074000
  . end time:    	 2012-06-06 18:11:03.612000
  . time length: 	 0:11:45.538000 [h:mm:ss.s]
  . total blocks:	 24127
  . blocks list:
  [type] --> [counter]
   #0x6e --> 7057
   #0x59 --> 4483
   #0x58 --> 4483
   #0x4e --> 4483
   #0x31 --> 709
   #0x50 --> 706
   #0x68 --> 705
   #0x43 --> 705
   #0x41 --> 698
   #0x47 --> 80
   #0x52 --> 15
   #0x69 --> 1
   #0x55 --> 1
   #0x49 --> 1
 
Finally, the number of TopBlocks counts are plotted using a bar chart:
In [8]:
%matplotlib inline

# Plot the blocks list as bar chart ordered from the most common
keys, values = zip(*sorted(datagram_dict.items(), key=operator.itemgetter(1), reverse=True))
fig, ax0 = plt.subplots(ncols=1, nrows=1, figsize=(10, 10), dpi=300)
matplotlib.rcParams.update({'font.size': 11, 'text.usetex': True, 'font.family': 'serif'})
bar = ax0.bar(range(len(values)), values, align='center', color='#1e90ff')
# attach some text labels
for rect in bar:
    height = rect.get_height()
    ax0.text(rect.get_x()+rect.get_width()/2., height + 100, '%d'%int(height), ha='center', va='bottom', color='#a0522d')
ax0.set_title('$TopBlocks$  $List$')
ax0.tick_params(axis='y', colors='#a0522d')
ax0.set_ylabel('$count$', color='#a0522d')
ax0.set_xlabel('$block$ $id[hex]$', color='#1e90ff')
ax0.set_xticks(range(len(keys)))
ax0.set_xticklabels([hex(lab) for lab in keys], color='#1e90ff')
fig.autofmt_xdate()
plt.show()

print("Done")
 
 
Done
In [9]:
print("Done")
 
Done

 

tags: notebook; huddl; stream; topblock