XML to tables (csv) with NiFi and Groovy Part 1 of 2

XMl is a common format used by many applications, so it will be one of the formats you need to support in your Big Data platform.

 

I have divided up this article in two part

Part 1 Preparation

Part 2 Coding and a complete working setup in NiFi with groovy

All code used in this article can be downloaded from https://github.com/maxbback/nifi-xml/

XML is a structured format that is good for application but not so convenient if you are used to work with SQL and tables so I decided to design and write code for transforming XML to CSV.

 

You might thing that converting from XML to CSV is a very simple thing, but it is not XML is a hierarchical tree structured data format while CSV is a flat column based format.

 

A branch in XML can key value pairs and branches, branches is of type one2one or ont2many

 

When we convert an XML the best is if we have a schema so we know  how the XML file can look like as we then has the description for how the DB table shall be defined.

 

My design is based on that we have a XSD schema as a base to secure that we know which table columns we need.

 

In below schema we have example of many of the patterns we often see in a schema, lets discuss them.

Element person has an attribute name and can exist in any number of instances and needs to be its own table.

Element person has direct element childs;

  • Full_name is a simple element that is of type string and is a column in person table
  • Child_name is a simple element of type string that can exist in 0-5 times and can either be its own table och be 5 columns in person table “child_name1-5”
  • Child_name_complex is unbound and has to be its own table

Element child_name_complex has direct element childs;

  • Length is a simple element of type decimal and is a column in child_name_complex table
  • Age is also a simple element and will also be a column in child_name_complex
  • Mother is a complex type but can only exist 1 time and can then be flattened out to be columns in child_name_complex
<xs:element name="persons">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="person" maxOccurs="unbounded">
        <xs:complexType>
          <xs:sequence>
            <xs:element name="full_name" type="xs:string"/>
            <xs:element name="child_name" type="xs:string" minOccurs="0" maxOccurs="5"/>
            <xs:element name="child_name_complex" minOccurs="0" maxOccurs="unbounded">
              <xs:complexType>
                <xs:sequence>
                  <xs:element name="length" type="xs:decimal" minOccurs="0" nillable="true"/>
                  <xs:element name="age" type="xs:decimal" minOccurs="0" nillable="true"/>
                  <xs:element name="mother" minOccurs="0" maxOccurs="1">
                    <xs:complexType>
                      <xs:sequence>
                        <xs:element name="name" type="xs:string" minOccurs="0" nillable="true"/>
                      </xs:sequence>
                    </xs:complexType>
                  </xs:element>
                </xs:sequence>
              </xs:complexType>
            </xs:element>
          </xs:sequence>
        </xs:complexType>
      </xs:element>
    </xs:sequence>
  </xs:complexType>
</xs:element>

Rules for XML to tables

We need simple rules to follow when we are flattening out the XML tree to tables and columns.

 

  1. Element with maxOccurs=”unbound” must be its own table
  2. Element with maxOccurs=”1” can be flattened out to be column in its parent but child elements must be renamed to include the parent name
    1. Example mother_name for mother element name
  3. Attributes is column in its parent
  4. Element of type simple that has a limited maxOccurs can be flattened out to columns, but if the value is high it’s better to generate a new table
  5. Tables of child’s need to be unique this can be accomplished be including parent table name in the table name or make each table unique by adding a suffix to them.
  6. If element that represent a table has 1 attribute we treat this as the primary key
  7. All child table has the parent primary key as a foreign key
  8. If table do not have a natural primary key we set it to current timestamp of execution for each top level table.

 

 

Sometimes XSD schema is very open which will result in many tables, if you know your data and the alternatives of its look, it is better to make a more strict schema which will dramatically lower the number of tables, if you use many  1t1 branches remove them and make them simple elements instead as you then will get shorter column names.

 

Use Cases

Let’s define our expectations of what we need;

  • DDL to create tables
  • Hiarchical diagram in a simple form to show table relations and columns in each table
  • The content of each table after parsing the XML passed on in its own flow file for each table

The basic code;

This code need a bit of a clean up which we will solve in next article, but this will give you the basic concept.

 

It is requiring a XSD file and a XML file, and it then printing out the result with relation to each table.

 

In next article I will provide the groovy script for NiFi and a cleanup of current code.

 

 

class Column {
	def path = []
	def colName
	def name
	def type
	def attribute
}

class Table {
	def path = []
	def name
	def schema
	def columns = []
	def primaryKey = null
	def foreignKey = null
	def foreignKeyPath = []
}
/*
def getNodes = { doc, path ->
	def nodes = doc
	path.split("\\.").each { nodes = nodes."${it}" }
	return nodes
}
*/
def getNodes (doc,path) {
	def node = doc
	path.each
	{
		node=node."${it}"	
	}
	return node
}


def myTables = []

def showElement(myTables,doc_xml,doc_xsd,element,table,path,colNamePrefix)
{
	//println("# ${table.name}")
	def nrAttributes = 0
	def keyAttribute = ""
	element.complexType.attribute.each
	{
		nrAttributes=nrAttributes+1
		//println("@ ${table.name},${it.@name},${it.@type}")
		col = new Column ()
		col.colName = "${colNamePrefix}${it.@name}"
		col.name = it.@name
		col.type = it.@type
		if (path.size() > 0) {
			col.path.addAll(path)
		}
		col.attribute = 1
		table.columns.add(col)
		keyAttribute = it.@name
	}
	if (nrAttributes == 1 && table.primaryKey == null)
	{
		// if only one attribute we treat it as primarykey
		table.primaryKey=keyAttribute
	}

	element.complexType.sequence.element.each
	{
		//println("${table.name},${it.@name}")
		if (!it.@type.isEmpty()) {
			//println("${table.name},${it.@name},${it.@type}")
			col = new Column ()
			col.colName = "${colNamePrefix}${it.@name}"
			col.name = it.@name
			col.type = it.@type
			if (path.size() > 0) {
				col.path.addAll(path)
			}
			col.attribute = 0
			table.columns.add(col)
		} else {
			if (it.@maxOccurs.isEmpty() || it.@maxOccurs == "1") {
				//println ("1to1 Branch ${it.@maxOccurs} ${it.@name} ${table.path} ${path}")	
				def name = it.@name
				// Flatten out this branch
				def newPath = []
				if (path.size() > 0) {
					newPath.addAll(path)
				}
				newPath.add(it.@name)
				showElement(myTables,doc_xml,doc_xsd,it,table,newPath,"${colNamePrefix}${it.@name}_")
				
			} else {
				newtable = new Table()
				newtable.columns = []
				newtable.name = "${table.name}_${it.@name}"
				if (table.path.size() > 0) {
					newtable.path.addAll(table.path)
				}
				newtable.path.add(it.@name)
				if (path.size() > 0) {
					newtable.path.addAll(path)
				}
				if (table.primaryKey != null) {
					newtable.foreignKey = table.primaryKey
					if (table.path.size() > 0) {
						newtable.foreignKeyPath.addAll(table.path)
					}
				}
				else if (table.foreignKey) {
					newtable.foreignKey = table.foreignKey
					if (table.foreignKeyPath.size() > 0) {
						newtable.foreignKeyPath.addAll(table.foreignKeyPath)
					}
				}

				myTables.add(newtable)
				//println ("1toM Branch ${it.@maxOccurs} ${it.@name} ${newtable.path} ${path}")	
				
				showElement(myTables,doc_xml,doc_xsd,it,newtable,[],"")
			}
		}
	}
}
doc_xsd = new XmlSlurper().parse("my.xsd")
doc_xml = new XmlSlurper().parse("my.xml")

def nodeA=doc_xml.item.Application.AnswerCodeFromUC
nodeA=nodeA.".."
println(nodeA)
quit()
def n = 3
n.times {
   println "Hello World ${it}"
}

doc_xsd.element.each
{
	table = new Table()
	table.name = it.@name
	table.columns = []
	table.path = []
	myTables.add(table)
	showElement(myTables,doc_xml,doc_xsd,it,table,[],"")
}
println("## ddl ###############################################")
myTables.each
{
	def p = it.path.join(".")
	println("### table ${it.name} # ${p}")
	def cols = []
	if (it.primaryKey) {
		println("primary key ${it.primaryKey} ${it.foreignKey} ${it.foreignKeyPath.join('.')}")
	}
	it.columns.each
	{
		p = it.path.join(".")
		println("col ${it.name} ${it.type} ${it.attribute} ${p}")
		cols.add(it.name)
	}
}

println("######## xml")
def date = new Date()
def timeStamp = date.getTime()

myTables.each
{
	def tableName = it.path.join('.')
	println("#table ${tableName}")
	def table = it
	def columns = it.columns
	def colArray = ["apk","afk"]
	def pk,fk,n
	if (it.primaryKey) {
		n = getNodes(doc_xml,table.path)
		pk = n.@"${it.primaryKey}"	
	}
	else {
		pk = timeStamp
	}
	def foreignKey = null
	def foreignKeyPath = []
	if (it.foreignKey) {
		n = getNodes(doc_xml,table.foreignKeyPath)
		pk = n.@"${it.primaryKey}"	
	}
	else {
		pk = timeStamp
	}
	it.columns.each {
		def p = []
		if (it.path.size() > 0) {
			p.addAll(it.path)
		}
		p.add(it.name)
		colArray.add(p.join('.'))
	}
	println(colArray.join(','))
	getNodes(doc_xml,table.path).each
	{
		colArray = []

		def record = it
		columns.each
		{
			def node = getNodes(record,it.path)
			def p = it.path.join('.')
			if (it.attribute == 1)
			{
				def v = node.@"${it.name}"
				def val = v.text()
				colArray.add(val)
				//println("attribute ${p} ${it.name}:${it.type}:${val}")
			} else {
				def v = node."${it.name}"
				def val = v.text()
				colArray.add(val)
				//println("col ${p} ${it.name}:${it.type}:${val}")
			}
		}
		println(colArray.join(','))
	}
}

1 Comment

  1. […] the first article http://max.bback.se/index.php/2018/06/10/xml-to-tables-csv-with-nifi-and-groovy-part-1-of-2/ I described the problem with ingesting XML files into relational tables, and my design in how to do […]

Leave a Reply

Your email address will not be published. Required fields are marked *