` Code Snippets - A local file cache for Amazon S3

A utility for importing/exporting between hbase and csv

HBase is a NOSQL distributed database system that runs a cluster of machines on top of Hadoop. HBase and Hadoop are both top-level projects in the Apache Software Foundation and very useful for storing and processing big data.

This snippet, hbaseload, illustrates how to transfer between HBase and CSV files. There are significant differences between HBase and a conventional relational database, and this influences the import and export process. Because components in the hadoop stack are typically written in java, this snippet will also use java.

To run HBaseExporter and HBaseImporter, supply the HBase table name as the first argument and the path to the CSV file as the second argument.

java -classpath <include HBase jars and conf directory> HBaseImporter MyTable /home/me/mydata.csv
        
java -classpath <include HBase jars and conf directory> HBaseExporter MyTable /home/me/mydata.csv
        

HBaseImporter can be used with an extra argument which specifies the name of a CSV field which should hold a unique value for each row in the file. The way this is used is explained further below. HBaseImporter and HBaseExporter could be run on one of the machines in the HBase/Hadoop cluster but would more usually be used on a remote (client) system where Hadoop and HBase have been installed and are used to connect to the cluster rather than as running as services.

hbaseload is divided into two parts. The first part contains code for importing and exporting CSV, by defining two classes called CsvInputStream and CsvOutputStream. These classes share a common super-class, CsvStream.

CsvStream implements common settings for reading and writing CSV data - enabling the delimiter and the file encoding to be configured.

CsvStream.java
// CsvStream.java - set configuration options for CSV format
// Copyright (c) 2011 Niall McCarroll  
// Distributed under the MIT/X11 License (http://www.mccarroll.net/snippets/license.txt)

package net.mccarroll.hbaseload;

public class CsvStream {
	char delimiter = ',';
	String encoding = "UTF-8";
	
	public void setEncoding(String encoding) {
		this.encoding = encoding;
	}
	
	public void setDelimiter(char delimiter) {
		this.delimiter = delimiter;
	}
}

CsvOutputStream is reasonably straightforwards to implement.

CsvOutputStream.java
// CsvOutputStream.java - write data to a CSV formatted file (ref http://tools.ietf.org/html/rfc4180)
// Copyright (c) 2011 Niall McCarroll  
// Distributed under the MIT/X11 License (http://www.mccarroll.net/snippets/license.txt)

package net.mccarroll.hbaseload;

import java.io.*;

public class CsvOutputStream extends CsvStream implements Closeable {
	
	OutputStream out;
	BufferedWriter bufferedWriter;

	public CsvOutputStream(OutputStream out) {
		this.out = out;
	}
	
	public void init() throws UnsupportedEncodingException {
		OutputStreamWriter streamWriter = new OutputStreamWriter(out, encoding);
		bufferedWriter = new BufferedWriter(streamWriter);
	}
	
	public static String dumpCsvLine(String []vals, char delimiter) {
		StringBuffer s = new StringBuffer();
		for(int i=0; i<vals.length; i++) {
			if (i>0) {
				s.append(delimiter);
			}
			s.append(dumpCsvValue(vals[i],delimiter));
		}
		return s.toString();
	}
	
	public static String dumpCsvValue(String value, char delimiter) {
		if (value.indexOf(delimiter)>=0 || value.contains("\"") || value.contains("\n") || value.contains("\r")) {
			return "\""+value.replace("\"", "\"\"")+"\"";
		} else {
			return value;
		}
	}
	
	public void writeLine(String[]vals) throws IOException {
		if (bufferedWriter == null) {
			init();
		}
		bufferedWriter.write(dumpCsvLine(vals,delimiter)+"\n");
	}
	
	public void close() throws IOException {
		bufferedWriter.close();
	}
}

CsvInputStream is a little more complicated, to allow for the way that CSV values can contain quoted delimiters and newline characters. The implementation is a little messy and could probably be streamlined. There are probably many other open source implementations.

CsvInputStream.java
// CsvInputStream.java - read data from a CSV formatted file (ref http://tools.ietf.org/html/rfc4180)
// Copyright (c) 2011 Niall McCarroll  
// Distributed under the MIT/X11 License (http://www.mccarroll.net/snippets/license.txt)

package net.mccarroll.hbaseload;

import java.io.*;
import java.util.ArrayList;
import java.util.List;


public class CsvInputStream extends CsvStream implements Closeable {
	
	InputStream in = null;
	BufferedReader bufferedReader = null;
	
	public CsvInputStream(InputStream in) {
		this.in = in;
	}
	
	private void init() throws IOException {
		InputStreamReader streamReader = new InputStreamReader(in, encoding);
		bufferedReader = new BufferedReader(streamReader);
	}
	
	public static String[] parseCsvLine(char delimiter, BufferedReader reader) throws IOException {
		String line = getLine(reader);
		if (line == null) {
			return null;
		}
		
		List<String> vals = new ArrayList<String>();
		boolean inquotes = false;
		StringBuffer val = new StringBuffer();
		
		do {
			int i = 0;
			while(i<line.length()) {
				char c = line.charAt(i);
				char nc = (i<line.length()-1) ? line.charAt(i+1) : '\0';
				switch(c) {
					case '"':
						if (!inquotes) {
							val.setLength(0);
							inquotes = true;
						} else if (nc == '"') {
							val.append('"');
							i += 1;
						} else {
							inquotes = false;
							while(i < line.length()) {
								if (line.charAt(i) == delimiter) {
									vals.add(val.toString());
									val.setLength(0);
									break;
								}
								i++;
							}
							
						}
						break;
					
					default:
						if (c == delimiter) {
							if (!inquotes) {
								vals.add(val.toString());
								val.setLength(0);
							} else {
								val.append(c);
							}
						} else {
							if ((c != '\n' && c != '\r') || inquotes) {
								val.append(c);
							}
						}
						break;
				}
				i+=1;
			}	
			
			if (inquotes) {
				line = getLine(reader);
				if (line == null) {
					return null;
				}
			} 
		} while(inquotes);
		
		vals.add(val.toString());
		return vals.toArray(new String[0]);
	}
	
	public String[] readLine() throws IOException {
		if (bufferedReader == null) {
			init();
		}
		return parseCsvLine(delimiter,bufferedReader);
	}
	
	private static String getLine(BufferedReader reader) throws IOException {
		StringBuffer sb = new StringBuffer();
		int i;
		while((i=reader.read())>=0) {
			char c = (char)i;
			sb.append(c);
			if (c == '\n') {
				return sb.toString();
			}
		}
		if (sb.length()>0) {
			return sb.toString();
		} else {
			return null; // EOS
		}
	}
	
	public void close() throws IOException {
		if (bufferedReader != null) {
			bufferedReader.close();
		}
	}

}

A brief unit test was developed to cover the CSV handling classes. More comprehensive unit tests are really required here, but the brief test covers some of the edge cases.

CsvUnitTest.java
// CsvUnitTest.js - unit test for CsvInputStream and CsvOutputStream
// Copyright (c) 2011 Niall McCarroll  
// Distributed under the MIT/X11 License (http://www.mccarroll.net/snippets/license.txt)

package net.mccarroll.hbaseload;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.InputStream;

import org.junit.Test;
import org.junit.Assert;

public class CsvUnitTest {
	
	String input = "field1,field2,field3,field4\n"
					+"This,is,a,test\n"
					+"This,is,\"one more\",test\n"
					+"This,is also,\"one\nmore\",test\n"
					+"こんにちは,здравей,\"હેલ્લો\",hello\n"
					+"\"\"\"another\"\"\",\" edge \", case, \"here\" \n";
	
	String[][] output = {{"field1", "field2", "field3", "field4"},
						{"This", "is", "a", "test"},
						{"This", "is", "one more", "test"},
						{"This", "is also", "one\nmore", "test"},
						{ "こんにちは", "здравей", "હેલ્લો","hello" },
						{"\"another\""," edge "," case","here"}};
	
	String cannonical = "field1,field2,field3,field4\n"
			+"This,is,a,test\n"
			+"This,is,one more,test\n"
			+"This,is also,\"one\nmore\",test\n"
			+"こんにちは,здравей,હેલ્લો,hello\n"
			+"\"\"\"another\"\"\", edge , case,here\n";

	@Test
	public void test() throws Exception {
		InputStream sis = new ByteArrayInputStream(input.getBytes("UTF-8"));
		ByteArrayOutputStream sos = new ByteArrayOutputStream();
		
		CsvInputStream cis = new CsvInputStream(sis);
		cis.setEncoding("UTF-8");
		cis.setDelimiter(',');
		
		CsvOutputStream cos = new CsvOutputStream(sos);
		cos.setEncoding("UTF-8");
		cos.setDelimiter(',');
		
		String []row;
		int rowIndex = 0;
		while((row=cis.readLine()) != null) {
			String[] expected = output[rowIndex];
			Assert.assertArrayEquals(expected, row);
			cos.writeLine(row);
			rowIndex++;
		}
		Assert.assertTrue(rowIndex == output.length);
		cis.close();
		cos.close();
		
		Assert.assertEquals(new String(sos.toByteArray(),"UTF-8"),cannonical);
	}

}

Now the CSV related classes introduced above can be used to build the HBase exporter and importer. The remaining classes require the CLASSPATH to include:

HBaseImportExport is a superclass which defines some common code shared between HBaseExporter and HBaseImporter.

HbaseImportExport.java
// HBaseImportExport.java - common definitions for HBase/CSV import/export 
// Copyright (c) 2011 Niall McCarroll  
// Distributed under the MIT/X11 License (http://www.mccarroll.net/snippets/license.txt)

package net.mccarroll.hbaseload;

public class HBaseImportExport {
	
	public String DEFAULT_COLUMN_FAMILY = "c1";
	
	public class HBaseCol {
		String family;
		String qualifier;
		HBaseCol(String family,String qualifier) {
			this.family = family;
			this.qualifier = qualifier;
		}
	}

}

HBaseExporter is the class which handles the extraction of data from an HBase table into a CSV formatted file. It has the tricky task of working out which fields to create in the CSV file. HBase does not define fixed columns as a relational database does. It does have the concept of rows, with each row associated with a unique key value. However, rows can contain a dynamic number of cells (each cell is addressed by a combination of column family and column qualifier). HBaseExporter selects the fields to be exported to CSV based on only the first row of the table. It is not suitable for use where the table contains sparse data (meaning that there are rows where not all columns are assigned values).

HBaseExporter.java
// HBaseExporter.java - export data from an HBase table to a CSV formatted file 
// Copyright (c) 2011 Niall McCarroll  
// Distributed under the MIT/X11 License (http://www.mccarroll.net/snippets/license.txt)

package net.mccarroll.hbaseload;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.log4j.Logger;

public class HBaseExporter extends HBaseImportExport {
	
	String tableName;
	
	/**
	 * Construct an exporter for an HBase table
	 * 
	 * @param tableName name of the table in HBase
	 */
	HBaseExporter(String tableName) {
		this.tableName = tableName;
	}
	
	/**
	 * export CSV from an HBase table
	 * 
	 * @param csvFilePath CSV file to export
	 * 
	 * @throws IOException
	 */
	public void exportCSV(File csvFilePath) throws IOException {
		Configuration config = HBaseConfiguration.create(); 
		
		HBaseAdmin admin = new HBaseAdmin(config);
		HTableDescriptor desc = admin.getTableDescriptor(tableName.getBytes());
		
		OutputStream os = new FileOutputStream(csvFilePath);
		CsvOutputStream cos = new CsvOutputStream(os);
		cos.setEncoding("UTF-8");
		cos.setDelimiter(',');
		
		HTable table = new HTable(config, tableName);
		
		Scan scan = new Scan();
        ResultScanner resultScanner = table.getScanner(scan);
        Result result;
        
        int counter = 0;
        
        List<HBaseCol> columns = new ArrayList<HBaseCol>(); 
        List<String> values = new ArrayList<String>();
        
        Logger logger = org.apache.log4j.Logger.getLogger(this.getClass());
        
        while((result=resultScanner.next()) != null) {
        	
        	values.clear();
        	
        	if (columns.size() == 0) {
        		for(KeyValue key: result.list()) {
        			String family = new String(key.getFamily(),"UTF-8");
        			String qualifier = new String(key.getQualifier(),"UTF-8");
            		columns.add(new HBaseCol(family,qualifier));
            		values.add(family+":"+qualifier);
            	}
        		cos.writeLine(values.toArray(new String[0]));
        		values.clear();
        	}
        	
        	for(HBaseCol column: columns) {
        		byte[] val = result.getValue(column.family.getBytes("UTF-8"), column.qualifier.getBytes("UTF-8"));
        		if (val != null) {
        			values.add(new String(val,"UTF-8"));
        		} else {
        			values.add("");
        		}
        	}
        	
        	cos.writeLine(values.toArray(new String[0]));
        	counter += 1;
			if (counter % 10000 == 0) {
				logger.info("Exported "+counter+" records");
			}
        }
        cos.close();
	}

	public static void main(String[] args) throws IOException {
		if (args.length != 2) {
			System.out.println("Usage: HBaseExporter <tablename> <csv file path>");
		}
		
		String tableName = args[0];
		String filePath = args[1];
		File csvOutputFile = new File(filePath);
		HBaseExporter exporter = new HBaseExporter(tableName);
		exporter.exportCSV(csvOutputFile);
	}
}

HBaseExporter outputs field names in the first line of the exported CSV file. It combines the column family and column qualifier separated by a colon (:) to produce the CSV field name. Similarly, HBaseImporter interprets the first line of a CSV file as the field names. When working out how to import CSV data into HBase, HBaseImporter attempts to reverse this process when deciding how to map a CSV field name to an HBase column family and qualifier, if it finds a colon in the name.

HBaseImporter.java
// HBaseImporter.java - import data to an HBase table from a CSV formatted file 
// Copyright (c) 2011 Niall McCarroll  
// Distributed under the MIT/X11 License (http://www.mccarroll.net/snippets/license.txt)

package net.mccarroll.hbaseload;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.log4j.Logger;

public class HBaseImporter extends HBaseImportExport {
	
	HBaseAdmin admin;
	Configuration config;
	Set<String> families = new HashSet<String>();
	List<HBaseCol> columns = new ArrayList<HBaseCol>();
	String tableName;
	int keyPosition = -1;
	
	public HBaseImporter(String tableName) {
		this.tableName = tableName;
	}
	
	public void init() throws MasterNotRunningException, ZooKeeperConnectionException {
		config = HBaseConfiguration.create();  
		admin = new HBaseAdmin(config);
	}
	
	private void deleteTable() {
		try {
			admin.disableTable(tableName);    
			admin.deleteTable(tableName);
		} catch(Exception e) {
		}
	}
	
	private void createTable() throws IOException {
		HTableDescriptor desc = new HTableDescriptor(tableName);
		admin.createTable(desc);
		admin.disableTable(tableName);
		for(String family: families) {
			HColumnDescriptor cf1 = new HColumnDescriptor(family);
			admin.addColumn(tableName, cf1);      
		}
		admin.enableTable(tableName); 
	}
	
	private void analyzeHeaders(String []headers, String keyColumn) {
		columns.clear();
		families.clear();
		int col = 0;
		for(String header: headers) {
			String family = DEFAULT_COLUMN_FAMILY;
			String qualifier = header;
			int pos;
			if ((pos = header.indexOf(":")) > 0) {
				family = header.substring(0,pos);
				qualifier = header.substring(pos+1);
			}
			columns.add(new HBaseCol(family,qualifier));
			families.add(family);
			if (header.equals(keyColumn)) {
				keyPosition = col;
			}
			col++;
		}
	}
	
	private void loadData(CsvInputStream cis) throws IOException {
		
		HTable table = new HTable(config,tableName); 
		
		String vals[] = cis.readLine();
		
		Logger logger = org.apache.log4j.Logger.getLogger(this.getClass());
		int counter = 0;
		String rowId = "";
		while(vals != null) {
			if (keyPosition >= 0 && keyPosition < vals.length) {
				rowId = vals[keyPosition];
			} else {
				rowId = "r"+counter;
			}
			Put put = new Put(rowId.getBytes("UTF-8"));
			
			int col = 0;
			for(HBaseCol column: columns) {
				if (col >= vals.length) {
					break;
				}
				put.add(column.family.getBytes("UTF-8"), column.qualifier.getBytes(),vals[col].getBytes()); 
				col += 1;
			}
			table.put(put);
			vals = cis.readLine();
			counter += 1;
			if (counter % 10000 == 0) {
				logger.info("Imported "+counter+" records");
			}
		}
		cis.close();
	}

	/**
	 * import CSV to an HBase table
	 * 
	 * @param tableName name of the table in HBase
	 * @param csvFile a file
	 * 
	 * @throws IOException
	 */
	public void importCSV(File csvFile, String keyColumn) throws IOException {
		init();
		
		FileInputStream fis = new FileInputStream(csvFile);
		CsvInputStream cis = new CsvInputStream(fis);
		
		// read field names from the first line of the csv file
		analyzeHeaders(cis.readLine(),keyColumn);
		
		deleteTable();
		createTable();
		loadData(cis);
		cis.close();
	}
		
	public static void main(String[] args) throws IOException {
		if (args.length < 2 || args.length > 3) {
			System.out.println("Usage: HBaseImporter <tablename> <csv file path> [<key field name>]");
		}
		
		String tableName = args[0];
		File f = new File(args[1]);
		String keyColumn = null;
		if (args.length > 2) {
			keyColumn = args[2];
		}
		HBaseImporter importer = new HBaseImporter(tableName);
		importer.importCSV(f,keyColumn);
	}

}

If HBaseImporter is invoked with the name of a field in the CSV file which contains unique values, this value is used as the row key that HBase requires to uniquely identify each row, otherwise a synthetic key is constructed. Pleae note that HBaseImporter deletes the supplied table name if it already exists in HBase. Be careful when running HBaseImporter!.

Areas for further investigation include studying performance and support for parallel loading of multiple CSV files into the same HBase table. Support for exporting the row key values in HBaseExporter would also be a useful feature to include.


 

Leave a comment

Anti-Spam Check
Comment