[VFS] Implementing custom hdfs file system using commons-vfs 2.0

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[VFS] Implementing custom hdfs file system using commons-vfs 2.0

Richards Peter
Hi,

I am evaluating commons-vfs 2.0 for one of my use cases. I read that
commons--vfs 2.1 has a file system implementation for HDFS. Since
commons-vfs 2.1 is still in development and commons-vfs 2.1 does not have
all capabilities that we require for hdfs, I would like to implement a
custom file system with commons vfs 2.0 now and enchance commons-vfs 2.1
when that release is made.

Could you please tell me how to implement a such a file system for
commons-vfs 2.0? I would like to know:
1. The specific classes that need to be implemented.
2. How to register/supply these classes so that it can be used by my
application?
3. How the name resolution takes place when I provide a filepath of hdfs
file?

Thanks,
Richards Peter.
Reply | Threaded
Open this post in threaded view
|

Re: [VFS] Implementing custom hdfs file system using commons-vfs 2.0

Bernd Eckenfels
Hello Peter,

I would suggest you use the
current Version from CVS or the snapshot builds. This would have the
big advantage that you can actually test and contribute to this version
in case you miss some features or find some bugs.

If you want to implement your own file system provider, you typically
start to copy one of the existing providers and adopt it. The main work
is one in Implementing a specific FileObject which extends from
AbstractFileSystemObject and implements all the various doSomething()
methods.

Actually the JavaDoc of that Abstract Object is quite good in this
regard.

After you have implemented the new Filesystem, it will be available for
addProvider() or you can add it as a new provider to the xml
configuration of StandardFileSystemManager like described here:
http://commons.apache.org/proper/commons-vfs/api.html

Greetings
Bernd

Am Sun, 27 Jul 2014 15:24:57 +0530
schrieb Richards Peter <[hidden email]>:

> Hi,
>
> I am evaluating commons-vfs 2.0 for one of my use cases. I read that
> commons--vfs 2.1 has a file system implementation for HDFS. Since
> commons-vfs 2.1 is still in development and commons-vfs 2.1 does not
> have all capabilities that we require for hdfs, I would like to
> implement a custom file system with commons vfs 2.0 now and enchance
> commons-vfs 2.1 when that release is made.
>
> Could you please tell me how to implement a such a file system for
> commons-vfs 2.0? I would like to know:
> 1. The specific classes that need to be implemented.
> 2. How to register/supply these classes so that it can be used by my
> application?
> 3. How the name resolution takes place when I provide a filepath of
> hdfs file?
>
> Thanks,
> Richards Peter.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VFS] Implementing custom hdfs file system using commons-vfs 2.0

Bernd Eckenfels
Hello,

yes by default VFS offers a Input/OutputStream based interface to the
FileContent and a RandomAccess interface (which is specific to VFS).

I think the current HDFS provider (VFS2.1) does support only those two
(ReadOnly for the Random Access).

I am not sure if you can wrap one of the two into a RCFile or if you
can use that only with reah HDFS FileSystem objects (not familiar with
Hadoop).

There is the possibility to add extensions (operations). One possible
extension would be to retrieve the underlying HDFS file (or a Object
implementing the record based interface).

That is certainly the way to go if you need that kind of access,
however, if you want such specific HDFS access modes, I wonder if it
isnt the best to use HDFS only/directly? What is the motivation for
wrap it into VFS?

BTW: there was some interest in VFS on the HDFS developer mailinglist a
few weeks back. If you plan to do anything in that direction, you might
involve them as well.

I copy the commons-dev list, since I am not familiar with the HDFS
provider and also it is a general discussion.

Greetings
Bernd


 Am
Mon, 28 Jul 2014 15:57:57 +0530 schrieb Richards Peter
<[hidden email]>:

> Hi Bernd,
>
> I would like to clarify one more doubt. I found that commons-vfs is
> implemented based on java.io.*. Commons-vfs returns
> java.io.InputStream/java.io.OutputStream for reading/writing  files.
>
> I have a use case to read/write files from/to hdfs. These files may be
> txt(csv) or RCFiles(Record Columnar Files, using hive apis). Handling
> txt files is straight forward. I can wrap the InputStream and
> OutputStream to some reader and read the contents.
>
> However for RCFiles I have to use:
> https://hive.apache.org/javadocs/r0.10.0/api/org/apache/hadoop/hive/ql/io/RCFile.Writer.html
> https://hive.apache.org/javadocs/r0.10.0/api/org/apache/hadoop/hive/ql/io/RCFile.Reader.html
>
> In these classes, the methods exposed to write and read contents are
> not based on java input and output streams, but  append() and
> getCurrentRow() apis, both of which requires BytesRefArrayWritable
> objects.
>
> I think my use case is more related to the file content format,
> reader and writer. What would you recommend me in this scenario to
> read from and write to such files? Should I just hold the file object
> implementation reference in my own reader and writer classes and
> create RC File reader and writer instances within those classes? Can
> something else be done using commons-vfs to read from and write to
> files irrespective of the contents(eg: FileContent, FileContentInfo
> and FileContentInfoFactory)?
>
> Thanks,
> Richards Peter.
>
>
> On Mon, Jul 28, 2014 at 12:36 PM, Richards Peter
> <[hidden email]> wrote:
>
> > Hi Bernd,
> >
> > Thanks for your response.
> >
> > Our company does not allow the development team to use
> > candidate/snapshot releases of open-source projects. That is the
> > reason why I am checking about vfs 2.0 version.
> >
> > I am checking the code available in:
> >
> > http://svn.apache.org/viewvc/commons/proper/vfs/trunk/core/src/main/java/org/apache/commons/vfs2/provider/hdfs/
> > and
> >
> > https://github.com/pentaho/pentaho-hdfs-vfs/tree/master/src/org/pentaho/hdfs/vfs
> >
> > I would also like to check whether it is fine if I try to clarify my
> > doubts with you through this mail thread if I face any problems
> > while implementing hdfs file system for vfs-2.0. I will also check
> > vfs-2.1 and see whether I can contribute to that as well.
> >
> > Regards,
> > Richards Peter.
> >
> >
> > On Mon, Jul 28, 2014 at 1:17 AM, Bernd Eckenfels
> > <[hidden email]> wrote:
> >
> >> Hello Peter,
> >>
> >> I would suggest you use the
> >> current Version from CVS or the snapshot builds. This would have
> >> the big advantage that you can actually test and contribute to
> >> this version in case you miss some features or find some bugs.
> >>
> >> If you want to implement your own file system provider, you
> >> typically start to copy one of the existing providers and adopt
> >> it. The main work is one in Implementing a specific FileObject
> >> which extends from AbstractFileSystemObject and implements all the
> >> various doSomething() methods.
> >>
> >> Actually the JavaDoc of that Abstract Object is quite good in this
> >> regard.
> >>
> >> After you have implemented the new Filesystem, it will be
> >> available for addProvider() or you can add it as a new provider to
> >> the xml configuration of StandardFileSystemManager like described
> >> here: http://commons.apache.org/proper/commons-vfs/api.html
> >>
> >> Greetings
> >> Bernd
> >>
> >> Am Sun, 27 Jul 2014 15:24:57 +0530
> >> schrieb Richards Peter <[hidden email]>:
> >>
> >> > Hi,
> >> >
> >> > I am evaluating commons-vfs 2.0 for one of my use cases. I read
> >> > that commons--vfs 2.1 has a file system implementation for HDFS.
> >> > Since commons-vfs 2.1 is still in development and commons-vfs
> >> > 2.1 does not have all capabilities that we require for hdfs, I
> >> > would like to implement a custom file system with commons vfs
> >> > 2.0 now and enchance commons-vfs 2.1 when that release is made.
> >> >
> >> > Could you please tell me how to implement a such a file system
> >> > for commons-vfs 2.0? I would like to know:
> >> > 1. The specific classes that need to be implemented.
> >> > 2. How to register/supply these classes so that it can be used
> >> > by my application?
> >> > 3. How the name resolution takes place when I provide a filepath
> >> > of hdfs file?
> >> >
> >> > Thanks,
> >> > Richards Peter.
> >>
> >>
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VFS] Implementing custom hdfs file system using commons-vfs 2.0

dlmarion

The HDFS file system implementation for VFS 2.1 is read-only. I did not implement any of the write methods because I didn't need them at the time and there are some differences in writing to Hadoop and a Filesystem as you pointed out. If you want to use released HDFS VFS FileObject's, then you can use the one's I put into Accumulo (take a look at the 1.6.0 release). The objects in Accumulo will be removed when VFS 2.1 is released.

- Dave

----- Original Message -----

From: "Bernd Eckenfels" <[hidden email]>
To: [hidden email]
Cc: "Richards Peter" <[hidden email]>
Sent: Monday, July 28, 2014 4:50:55 PM
Subject: Re: [VFS] Implementing custom hdfs file system using commons-vfs 2.0

Hello,

yes by default VFS offers a Input/OutputStream based interface to the
FileContent and a RandomAccess interface (which is specific to VFS).

I think the current HDFS provider (VFS2.1) does support only those two
(ReadOnly for the Random Access).

I am not sure if you can wrap one of the two into a RCFile or if you
can use that only with reah HDFS FileSystem objects (not familiar with
Hadoop).

There is the possibility to add extensions (operations). One possible
extension would be to retrieve the underlying HDFS file (or a Object
implementing the record based interface).

That is certainly the way to go if you need that kind of access,
however, if you want such specific HDFS access modes, I wonder if it
isnt the best to use HDFS only/directly? What is the motivation for
wrap it into VFS?

BTW: there was some interest in VFS on the HDFS developer mailinglist a
few weeks back. If you plan to do anything in that direction, you might
involve them as well.

I copy the commons-dev list, since I am not familiar with the HDFS
provider and also it is a general discussion.

Greetings
Bernd


Am
Mon, 28 Jul 2014 15:57:57 +0530 schrieb Richards Peter
<[hidden email]>:

> Hi Bernd,
>
> I would like to clarify one more doubt. I found that commons-vfs is
> implemented based on java.io.*. Commons-vfs returns
> java.io.InputStream/java.io.OutputStream for reading/writing files.
>
> I have a use case to read/write files from/to hdfs. These files may be
> txt(csv) or RCFiles(Record Columnar Files, using hive apis). Handling
> txt files is straight forward. I can wrap the InputStream and
> OutputStream to some reader and read the contents.
>
> However for RCFiles I have to use:
> https://hive.apache.org/javadocs/r0.10.0/api/org/apache/hadoop/hive/ql/io/RCFile.Writer.html 
> https://hive.apache.org/javadocs/r0.10.0/api/org/apache/hadoop/hive/ql/io/RCFile.Reader.html 
>
> In these classes, the methods exposed to write and read contents are
> not based on java input and output streams, but append() and
> getCurrentRow() apis, both of which requires BytesRefArrayWritable
> objects.
>
> I think my use case is more related to the file content format,
> reader and writer. What would you recommend me in this scenario to
> read from and write to such files? Should I just hold the file object
> implementation reference in my own reader and writer classes and
> create RC File reader and writer instances within those classes? Can
> something else be done using commons-vfs to read from and write to
> files irrespective of the contents(eg: FileContent, FileContentInfo
> and FileContentInfoFactory)?
>
> Thanks,
> Richards Peter.
>
>
> On Mon, Jul 28, 2014 at 12:36 PM, Richards Peter
> <[hidden email]> wrote:
>
> > Hi Bernd,
> >
> > Thanks for your response.
> >
> > Our company does not allow the development team to use
> > candidate/snapshot releases of open-source projects. That is the
> > reason why I am checking about vfs 2.0 version.
> >
> > I am checking the code available in:
> >
> > http://svn.apache.org/viewvc/commons/proper/vfs/trunk/core/src/main/java/org/apache/commons/vfs2/provider/hdfs/ 
> > and
> >
> > https://github.com/pentaho/pentaho-hdfs-vfs/tree/master/src/org/pentaho/hdfs/vfs 
> >
> > I would also like to check whether it is fine if I try to clarify my
> > doubts with you through this mail thread if I face any problems
> > while implementing hdfs file system for vfs-2.0. I will also check
> > vfs-2.1 and see whether I can contribute to that as well.
> >
> > Regards,
> > Richards Peter.
> >
> >
> > On Mon, Jul 28, 2014 at 1:17 AM, Bernd Eckenfels
> > <[hidden email]> wrote:
> >
> >> Hello Peter,
> >>
> >> I would suggest you use the
> >> current Version from CVS or the snapshot builds. This would have
> >> the big advantage that you can actually test and contribute to
> >> this version in case you miss some features or find some bugs.
> >>
> >> If you want to implement your own file system provider, you
> >> typically start to copy one of the existing providers and adopt
> >> it. The main work is one in Implementing a specific FileObject
> >> which extends from AbstractFileSystemObject and implements all the
> >> various doSomething() methods.
> >>
> >> Actually the JavaDoc of that Abstract Object is quite good in this
> >> regard.
> >>
> >> After you have implemented the new Filesystem, it will be
> >> available for addProvider() or you can add it as a new provider to
> >> the xml configuration of StandardFileSystemManager like described
> >> here: http://commons.apache.org/proper/commons-vfs/api.html 
> >>
> >> Greetings
> >> Bernd
> >>
> >> Am Sun, 27 Jul 2014 15:24:57 +0530
> >> schrieb Richards Peter <[hidden email]>:
> >>
> >> > Hi,
> >> >
> >> > I am evaluating commons-vfs 2.0 for one of my use cases. I read
> >> > that commons--vfs 2.1 has a file system implementation for HDFS.
> >> > Since commons-vfs 2.1 is still in development and commons-vfs
> >> > 2.1 does not have all capabilities that we require for hdfs, I
> >> > would like to implement a custom file system with commons vfs
> >> > 2.0 now and enchance commons-vfs 2.1 when that release is made.
> >> >
> >> > Could you please tell me how to implement a such a file system
> >> > for commons-vfs 2.0? I would like to know:
> >> > 1. The specific classes that need to be implemented.
> >> > 2. How to register/supply these classes so that it can be used
> >> > by my application?
> >> > 3. How the name resolution takes place when I provide a filepath
> >> > of hdfs file?
> >> >
> >> > Thanks,
> >> > Richards Peter.
> >>
> >>
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: [VFS] Implementing custom hdfs file system using commons-vfs 2.0

Bernd Eckenfels
Hello Dave,

for the download page (staged:
http://people.apache.org/~ecki/commons-vfs/download.html) I need a list
of libraries needed to run VFS to access hdfs. Could you maybe produce
an example VFS Shell session similiar to

https://wiki.apache.org/commons/VfsReleaseState

where you list the command line needed. It would be best if this is an
public HDFS instance (but I guess there is no such thing?)

BTW: I had asked in the VFS-530 bug about how commonplace the different
hdfs APIs are, and if it is really good if we bump the minimum version.
As I understand it. Will HDFS with 1.1.2 be able to communicate with
2.6 instances? If yes I would prefer to keep it at 1.2 in this release.
WDYT?

Gruss
Bernd

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VFS] Implementing custom hdfs file system using commons-vfs 2.0

dlmarion
Bernd,

Wanted to get back to you right away. I should be able to get the information to you in a day or so (hopefully tonight). I don't know of any public HDFS instances, but I will see what I can find. Regarding VFS-530, I just wanted to bump the version to the latest for the 2.1 release. I will have to ask on the hdfs-dev list if the Hadoop 1.1.2 and 2.6.0 are api compatible.

Dave

----- Original Message -----

From: "Bernd Eckenfels" <[hidden email]>
To: [hidden email]
Cc: "Commons Developers List" <[hidden email]>
Sent: Friday, January 9, 2015 9:05:55 AM
Subject: Re: [VFS] Implementing custom hdfs file system using commons-vfs 2.0

Hello Dave,

for the download page (staged:
http://people.apache.org/~ecki/commons-vfs/download.html) I need a list
of libraries needed to run VFS to access hdfs. Could you maybe produce
an example VFS Shell session similiar to

https://wiki.apache.org/commons/VfsReleaseState 

where you list the command line needed. It would be best if this is an
public HDFS instance (but I guess there is no such thing?)

BTW: I had asked in the VFS-530 bug about how commonplace the different
hdfs APIs are, and if it is really good if we bump the minimum version.
As I understand it. Will HDFS with 1.1.2 be able to communicate with
2.6 instances? If yes I would prefer to keep it at 1.2 in this release.
WDYT?

Gruss
Bernd

Reply | Threaded
Open this post in threaded view
|

Re: [VFS] Implementing custom hdfs file system using commons-vfs 2.0

dlmarion
In reply to this post by Bernd Eckenfels
Bernd,

To answer your question about the HDFS version, I went back and looked at the patches I supplied for VFS-530. In the patches that I supplied to bump the Hadoop version to 2.4.0 and 2.6.0 the only changes were in the pom. This should mean that, for the parts of the Hadoop client objects that I am using, they are API compatible.

For the example, the list of jars depends on the Hadoop version. I think one way to handle that is to use the `hadoop classpath` command to create the classpath. The example would then look like:

REP=~/.m2/repository
HADOOP_HOME=<PATH TO HADOOP>
HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
LIBS=$REP/commons-logging/commons-logging/1.2/commons-logging-1.2.jar
LIBS=$LIBS:core/target/commons-vfs2-2.1-SNAPSHOT.jar:examples/target/commons-vfs2-examples-2.1-SNAPSHOT.jar:sandbox/target/commons-VFS2-sandbox-2.1.jar
LIBS=$LIBS:$HADOOP_CLASSPATH
java -cp $LIBS org.apache.commons.vfs2.example.Shell


On my local machine, this looks like:

08:03:17 ~/EclipseWorkspace/commons-vfs2-project$ unset REP
08:03:18 ~/EclipseWorkspace/commons-vfs2-project$ unset HADOOP_HOME
08:03:18 ~/EclipseWorkspace/commons-vfs2-project$ unset LIBS
08:03:18 ~/EclipseWorkspace/commons-vfs2-project$ REP=~/.m2/repository
08:03:18 ~/EclipseWorkspace/commons-vfs2-project$ HADOOP_HOME=/home/dave/Software/hadoop-2.6.0
08:03:18 ~/EclipseWorkspace/commons-vfs2-project$ HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
08:03:18 ~/EclipseWorkspace/commons-vfs2-project$ LIBS=$REP/commons-logging/commons-logging/1.2/commons-logging-1.2.jar
08:03:18 ~/EclipseWorkspace/commons-vfs2-project$ LIBS=$LIBS:core/target/commons-vfs2-2.1-SNAPSHOT.jar:examples/target/commons-vfs2-examples-2.1-SNAPSHOT.jar:sandbox/target/commons-VFS2-sandbox-2.1.jar
08:03:18 ~/EclipseWorkspace/commons-vfs2-project$ LIBS=$LIBS:$HADOOP_CLASSPATH
08:03:18 ~/EclipseWorkspace/commons-vfs2-project$ java -cp $LIBS org.apache.commons.vfs2.example.Shell
15/01/09 20:03:18 INFO impl.StandardFileSystemManager: Using "/tmp/vfs_cache" as temporary files store.
VFS Shell 2.1-SNAPSHOT
> info
Default manager: "org.apache.commons.vfs2.impl.StandardFileSystemManager" version 2.1-SNAPSHOT
Provider Schemes: [https, res, gz, hdfs, sftp, ftps, ram, http, file, ftp, tmp, bz2]
Virtual Schemes: [zip, war, par, ear, jar, sar, ejb3, tar, tbz2, tgz]


Is this sufficient?

Dave

----- Original Message -----

From: "Bernd Eckenfels" <[hidden email]>
To: [hidden email]
Cc: "Commons Developers List" <[hidden email]>
Sent: Friday, January 9, 2015 9:05:55 AM
Subject: Re: [VFS] Implementing custom hdfs file system using commons-vfs 2.0

Hello Dave,

for the download page (staged:
http://people.apache.org/~ecki/commons-vfs/download.html) I need a list
of libraries needed to run VFS to access hdfs. Could you maybe produce
an example VFS Shell session similiar to

https://wiki.apache.org/commons/VfsReleaseState 

where you list the command line needed. It would be best if this is an
public HDFS instance (but I guess there is no such thing?)

BTW: I had asked in the VFS-530 bug about how commonplace the different
hdfs APIs are, and if it is really good if we bump the minimum version.
As I understand it. Will HDFS with 1.1.2 be able to communicate with
2.6 instances? If yes I would prefer to keep it at 1.2 in this release.
WDYT?

Gruss
Bernd

Reply | Threaded
Open this post in threaded view
|

Re: [VFS] Implementing custom hdfs file system using commons-vfs 2.0

Bernd Eckenfels
Am Sat, 10 Jan 2015 01:04:56 +0000 (UTC)
schrieb [hidden email]:

> To answer your question about the HDFS version, I went back and
> looked at the patches I supplied for VFS-530. In the patches that I
> supplied to bump the Hadoop version to 2.4.0 and 2.6.0 the only
> changes were in the pom. This should mean that, for the parts of the
> Hadoop client objects that I am using, they are API compatible.

Well, you never know. But if thats the case there is also no problem if
we stay at the older version for now.

> For the example, the list of jars depends on the Hadoop version. I
> think one way to handle that is to use the `hadoop classpath` command
> to create the classpath. The example would then look like:

Looks great, thanks.

> VFS Shell 2.1-SNAPSHOT
> > info
> Default manager:
> "org.apache.commons.vfs2.impl.StandardFileSystemManager" version
> 2.1-SNAPSHOT Provider Schemes: [https, res, gz, hdfs, sftp, ftps,
> ram, http, file, ftp, tmp, bz2] Virtual Schemes: [zip, war, par, ear,
> jar, sar, ejb3, tar, tbz2, tgz]
>
>
> Is this sufficient?

Did you try to use ls or another command to actually connect to a
hdfs name node? I think you should be able to provide
"user:password@host" with the Shell.

I have currently the problem, that the hdfs tests fail (under Linux)
with some form of directory lock problem. I am not sure if this worked
before (as I havent tested on my windows). I will see if the version
upgrade helps, if yes I will for sure include it in 2.1

Greetings
Bernd

>
> Dave
>
> ----- Original Message -----
>
> From: "Bernd Eckenfels" <[hidden email]>
> To: [hidden email]
> Cc: "Commons Developers List" <[hidden email]>
> Sent: Friday, January 9, 2015 9:05:55 AM
> Subject: Re: [VFS] Implementing custom hdfs file system using
> commons-vfs 2.0
>
> Hello Dave,
>
> for the download page (staged:
> http://people.apache.org/~ecki/commons-vfs/download.html) I need a
> list of libraries needed to run VFS to access hdfs. Could you maybe
> produce an example VFS Shell session similiar to
>
> https://wiki.apache.org/commons/VfsReleaseState 
>
> where you list the command line needed. It would be best if this is
> an public HDFS instance (but I guess there is no such thing?)
>
> BTW: I had asked in the VFS-530 bug about how commonplace the
> different hdfs APIs are, and if it is really good if we bump the
> minimum version. As I understand it. Will HDFS with 1.1.2 be able to
> communicate with 2.6 instances? If yes I would prefer to keep it at
> 1.2 in this release. WDYT?
>
> Gruss
> Bernd
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VFS] Implementing custom hdfs file system using commons-vfs 2.0

dlmarion
Bernd,

Regarding the Hadoop version for VFS 2.1, why not use the latest on the first release of the HDFS provider? The Hadoop 1.1.2 release was released in Feb 2013.

I just built 2.1-SNAPSHOT over the holidays with JDK 6, 7, and 8 on Ubuntu. What type of test errors are you getting? Testing is disabled on Windows unless you decide to pull in windows artifacts attached to VFS-530. However, those artifacts are associated with patch 3 and are for Hadoop 2.4.0. Updating to 2.4.0 would also be sufficient in my opinion.

I started up Hadoop 2.6.0 on my laptop, created a directory and file, then used the VFS shell to list and view the contents (remember, HDFS provider is read-only currently). Here is the what I did:

./hadoop fs -ls /

./hadoop fs -mkdir /vfs-test

./hadoop fs -ls /
Found 1 items
drwxr-xr-x - dave supergroup 0 2015-01-09 21:50 /vfs-test

echo "This is a test" > /tmp/test.txt

./hadoop fs -copyFromLocal /tmp/test.txt /vfs-test/text.txt

./hadoop fs -ls -R /
drwxr-xr-x - dave supergroup 0 2015-01-09 21:56 /vfs-test
-rw-r--r-- 3 dave supergroup 15 2015-01-09 21:56 /vfs-test/text.txt

./hadoop fs -cat /vfs-test/text.txt
This is a test

unset REP
unset HADOOP_HOME
unset HADOOP_CLASSPATH
unset LIBS
REP=~/.m2/repository
HADOOP_HOME=/home/dave/Software/hadoop-2.6.0
HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
LIBS=$REP/commons-logging/commons-logging/1.2/commons-logging-1.2.jar
LIBS=$LIBS:core/target/commons-vfs2-2.1-SNAPSHOT.jar:examples/target/commons-vfs2-examples-2.1-SNAPSHOT.jar:sandbox/target/commons-VFS2-sandbox-2.1.jar
LIBS=$LIBS:$HADOOP_CLASSPATH
java -cp $LIBS org.apache.commons.vfs2.example.Shell


10:01:41 ~/EclipseWorkspace/commons-vfs2-project$ unset REP
10:01:43 ~/EclipseWorkspace/commons-vfs2-project$ unset HADOOP_HOME
10:01:43 ~/EclipseWorkspace/commons-vfs2-project$ unset HADOOP_CLASSPATH
10:01:43 ~/EclipseWorkspace/commons-vfs2-project$ unset LIBS
10:01:43 ~/EclipseWorkspace/commons-vfs2-project$ REP=~/.m2/repository
10:01:43 ~/EclipseWorkspace/commons-vfs2-project$ HADOOP_HOME=/home/dave/Software/hadoop-2.6.0
10:01:43 ~/EclipseWorkspace/commons-vfs2-project$ HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
10:01:43 ~/EclipseWorkspace/commons-vfs2-project$ LIBS=$REP/commons-logging/commons-logging/1.2/commons-logging-1.2.jar
10:01:43 ~/EclipseWorkspace/commons-vfs2-project$ LIBS=$LIBS:core/target/commons-vfs2-2.1-SNAPSHOT.jar:examples/target/commons-vfs2-examples-2.1-SNAPSHOT.jar:sandbox/target/commons-VFS2-sandbox-2.1.jar
10:01:43 ~/EclipseWorkspace/commons-vfs2-project$ LIBS=$LIBS:$HADOOP_CLASSPATH
10:01:43 ~/EclipseWorkspace/commons-vfs2-project$ java -cp $LIBS org.apache.commons.vfs2.example.Shell
15/01/09 22:01:44 INFO impl.StandardFileSystemManager: Using "/tmp/vfs_cache" as temporary files store.
VFS Shell 2.1-SNAPSHOT
> info
Default manager: "org.apache.commons.vfs2.impl.StandardFileSystemManager" version 2.1-SNAPSHOT
Provider Schemes: [https, res, gz, hdfs, sftp, ftps, ram, http, file, ftp, tmp, bz2]
Virtual Schemes: [zip, war, par, ear, jar, sar, ejb3, tar, tbz2, tgz]
> info hdfs
Provider Info for scheme "hdfs":
capabilities: [GET_TYPE, READ_CONTENT, URI, GET_LAST_MODIFIED, ATTRIBUTES, RANDOM_ACCESS_READ, DIRECTORY_READ_CONTENT, LIST_CHILDREN]
> ls hdfs://Dave-laptop:8020/
15/01/09 22:02:06 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
15/01/09 22:02:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Contents of hdfs://dave-laptop:8020/
vfs-test/
> ls hdfs://Dave-laptop:8020/vfs-test/
Contents of hdfs://dave-laptop:8020/vfs-test
text.txt
> cat hdfs://Dave-laptop:8020/vfs-test/text.txt
This is a test

----- Original Message -----

From: "Bernd Eckenfels" <[hidden email]>
To: [hidden email]
Cc: "Commons Developers List" <[hidden email]>
Sent: Friday, January 9, 2015 8:13:27 PM
Subject: Re: [VFS] Implementing custom hdfs file system using commons-vfs 2.0

Am Sat, 10 Jan 2015 01:04:56 +0000 (UTC)
schrieb [hidden email]:

> To answer your question about the HDFS version, I went back and
> looked at the patches I supplied for VFS-530. In the patches that I
> supplied to bump the Hadoop version to 2.4.0 and 2.6.0 the only
> changes were in the pom. This should mean that, for the parts of the
> Hadoop client objects that I am using, they are API compatible.

Well, you never know. But if thats the case there is also no problem if
we stay at the older version for now.

> For the example, the list of jars depends on the Hadoop version. I
> think one way to handle that is to use the `hadoop classpath` command
> to create the classpath. The example would then look like:

Looks great, thanks.

> VFS Shell 2.1-SNAPSHOT
> > info
> Default manager:
> "org.apache.commons.vfs2.impl.StandardFileSystemManager" version
> 2.1-SNAPSHOT Provider Schemes: [https, res, gz, hdfs, sftp, ftps,
> ram, http, file, ftp, tmp, bz2] Virtual Schemes: [zip, war, par, ear,
> jar, sar, ejb3, tar, tbz2, tgz]
>
>
> Is this sufficient?

Did you try to use ls or another command to actually connect to a
hdfs name node? I think you should be able to provide
"user:password@host" with the Shell.

I have currently the problem, that the hdfs tests fail (under Linux)
with some form of directory lock problem. I am not sure if this worked
before (as I havent tested on my windows). I will see if the version
upgrade helps, if yes I will for sure include it in 2.1

Greetings
Bernd

>
> Dave
>
> ----- Original Message -----
>
> From: "Bernd Eckenfels" <[hidden email]>
> To: [hidden email]
> Cc: "Commons Developers List" <[hidden email]>
> Sent: Friday, January 9, 2015 9:05:55 AM
> Subject: Re: [VFS] Implementing custom hdfs file system using
> commons-vfs 2.0
>
> Hello Dave,
>
> for the download page (staged:
> http://people.apache.org/~ecki/commons-vfs/download.html) I need a
> list of libraries needed to run VFS to access hdfs. Could you maybe
> produce an example VFS Shell session similiar to
>
> https://wiki.apache.org/commons/VfsReleaseState 
>
> where you list the command line needed. It would be best if this is
> an public HDFS instance (but I guess there is no such thing?)
>
> BTW: I had asked in the VFS-530 bug about how commonplace the
> different hdfs APIs are, and if it is really good if we bump the
> minimum version. As I understand it. Will HDFS with 1.1.2 be able to
> communicate with 2.6 instances? If yes I would prefer to keep it at
> 1.2 in this release. WDYT?
>
> Gruss
> Bernd
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [VFS] Implementing custom hdfs file system using commons-vfs 2.0

Bernd Eckenfels
Hello,

Am Sat, 10 Jan 2015 03:12:19 +0000 (UTC)
schrieb [hidden email]:

> Bernd,
>
> Regarding the Hadoop version for VFS 2.1, why not use the latest on
> the first release of the HDFS provider? The Hadoop 1.1.2 release was
> released in Feb 2013.

Yes, you are right. We dont need to care about 2.0 as this is a new
provider. I will make the changes, just want to fix the current test
failures I see first.


> I just built 2.1-SNAPSHOT over the holidays with JDK 6, 7, and 8 on
> Ubuntu. What type of test errors are you getting? Testing is disabled
> on Windows unless you decide to pull in windows artifacts attached to
> VFS-530. However, those artifacts are associated with patch 3 and are
> for Hadoop 2.4.0. Updating to 2.4.0 would also be sufficient in my
> opinion.

Yes, what I mean is: I typically build under Windows so I would not
notice if the test starts to fail. However it seems to pass on the
integration build:

https://continuum-ci.apache.org/continuum/projectView.action?projectId=129&amp;projectGroupId=16

Running org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTest
Starting DataNode 0 with dfs.data.dir: target/build/test/data/dfs/data/data1,target/build/test/data/dfs/data/data2
Cluster is active
Cluster is active
Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.821 sec - in org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTest
Running org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTestCase
Starting DataNode 0 with dfs.data.dir: target/build/test2/data/dfs/data/data1,target/build/test2/data/dfs/data/data2
Cluster is active
Cluster is active
Tests run: 76, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 18.853 sec - in org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTestCase

Anyway, on a Ubuntu, I get this exception currently:

Running org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTestCase
Starting DataNode 0 with dfs.data.dir: target/build/test/data/dfs/data/data1,tar                                         get/build/test/data/dfs/data/data2
Cluster is active
Cluster is active
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.486 sec <<< FA                                         ILURE! - in org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTestCase
junit.framework.TestSuite@56c77035(org.apache.commons.vfs2.provider.hdfs.test.Hd                                         fsFileProviderTestCase$HdfsProviderTestSuite)  Time elapsed: 1.479 sec  <<< ERRO                                         R!
java.lang.RuntimeException: Error setting up mini cluster
        at org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTestCase$H                                         dfsProviderTestSuite.setUp(HdfsFileProviderTestCase.java:112)
        at org.apache.commons.vfs2.test.AbstractTestSuite$1.protect(AbstractTest                                         Suite.java:148)
        at junit.framework.TestResult.runProtected(TestResult.java:142)
        at org.apache.commons.vfs2.test.AbstractTestSuite.run(AbstractTestSuite.                                         java:154)
        at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.                                         java:86)
        at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provide                                         r.java:283)
        at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUni                                         t4Provider.java:173)
        at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4                                         Provider.java:153)
        at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider                                         .java:128)
        at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameCla                                         ssLoader(ForkedBooter.java:203)
        at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(Fork                                         edBooter.java:155)
        at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:                                         103)
Caused by: java.io.IOException: Cannot lock storage target/build/test/data/dfs/n                                         ame1. The directory is already locked.
        at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(St                                         orage.java:599)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:13                                         27)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:13                                         45)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:                                         1207)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:                                         187)
        at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:268)
        at org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTestCase$H                                         dfsProviderTestSuite.setUp(HdfsFileProviderTestCase.java:107)
        ... 11 more

Running org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTest
Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.445 sec - in            

When I delete the core/target/build/test/data/dfs/ directory and then run the ProviderTest I can do that multiple times and it works:

  mvn surefire:test -Dtest=org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTest

But when I run all tests or the HdfsFileProviderTestCase then it fails and afterwards not even the ProviderTest suceeds until I delete that dir.

(I suspect the "locking" is a missleading error, looks more like the data pool has some kind of instance ID which it does not have at the next run)

Looks like TestCase has a problem and ProviderTest does no proper pre-cleaning. Will check the source. More generally speaking it should not use a fixed working directory anyway.


> I started up Hadoop 2.6.0 on my laptop, created a directory and file,
> then used the VFS shell to list and view the contents (remember, HDFS
> provider is read-only currently). Here is the what I did:

Looks good. I will shorten it a bit and add it to the wiki. BTW: the warning, is this something we can change?

Gruss
Bernd

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [VFS] Implementing custom hdfs file system using commons-vfs 2.0

Bernd Eckenfels
Hello,

with this commit I added a cleanup of the data dir before the
DfsMiniCluster is started. I also use absolute file names to make
debugging a bit easier and I moved initialisation code to the
setUp() method

http://svn.apache.org/r1650847 & http://svn.apache.org/r1650852

This way the test do not error out anymore. But I have no idea why this
was happening on one machine and not on others (maybe a race, the
failing machine had SSD?).

So this means, now I can concentrate on merging the new version.

Gruss
Bernd


 Am Sun, 11 Jan 2015 01:25:48 +0100 schrieb Bernd Eckenfels
<[hidden email]>:

> Hello,
>
> Am Sat, 10 Jan 2015 03:12:19 +0000 (UTC)
> schrieb [hidden email]:
>
> > Bernd,
> >
> > Regarding the Hadoop version for VFS 2.1, why not use the latest on
> > the first release of the HDFS provider? The Hadoop 1.1.2 release was
> > released in Feb 2013.
>
> Yes, you are right. We dont need to care about 2.0 as this is a new
> provider. I will make the changes, just want to fix the current test
> failures I see first.
>
>
> > I just built 2.1-SNAPSHOT over the holidays with JDK 6, 7, and 8 on
> > Ubuntu. What type of test errors are you getting? Testing is
> > disabled on Windows unless you decide to pull in windows artifacts
> > attached to VFS-530. However, those artifacts are associated with
> > patch 3 and are for Hadoop 2.4.0. Updating to 2.4.0 would also be
> > sufficient in my opinion.
>
> Yes, what I mean is: I typically build under Windows so I would not
> notice if the test starts to fail. However it seems to pass on the
> integration build:
>
> https://continuum-ci.apache.org/continuum/projectView.action?projectId=129&amp;projectGroupId=16
>
> Running
> org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTest
> Starting DataNode 0 with dfs.data.dir:
> target/build/test/data/dfs/data/data1,target/build/test/data/dfs/data/data2
> Cluster is active Cluster is active Tests run: 13, Failures: 0,
> Errors: 0, Skipped: 0, Time elapsed: 11.821 sec - in
> org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTest
> Running
> org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTestCase
> Starting DataNode 0 with dfs.data.dir:
> target/build/test2/data/dfs/data/data1,target/build/test2/data/dfs/data/data2
> Cluster is active Cluster is active Tests run: 76, Failures: 0,
> Errors: 0, Skipped: 0, Time elapsed: 18.853 sec - in
> org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTestCase
>
> Anyway, on a Ubuntu, I get this exception currently:
>
> Running
> org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTestCase
> Starting DataNode 0 with dfs.data.dir:
> target/build/test/data/dfs/data/data1,tar
> get/build/test/data/dfs/data/data2 Cluster is active Cluster is
> active Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time
> elapsed: 1.486 sec <<< FA
> ILURE! - in
> org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTestCase
> junit.framework.TestSuite@56c77035(org.apache.commons.vfs2.provider.hdfs.test.Hd
> fsFileProviderTestCase$HdfsProviderTestSuite)  Time elapsed: 1.479
> sec  <<< ERRO                                         R!
> java.lang.RuntimeException: Error setting up mini cluster at
> org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTestCase$H
> dfsProviderTestSuite.setUp(HdfsFileProviderTestCase.java:112) at
> org.apache.commons.vfs2.test.AbstractTestSuite$1.protect(AbstractTest
> Suite.java:148) at
> junit.framework.TestResult.runProtected(TestResult.java:142) at
> org.apache.commons.vfs2.test.AbstractTestSuite.run(AbstractTestSuite.
> java:154) at
> org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.
> java:86) at
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provide
> r.java:283) at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUni
> t4Provider.java:173) at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4
> Provider.java:153) at
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider                                         .java:128)
> at
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameCla
> ssLoader(ForkedBooter.java:203) at
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(Fork
> edBooter.java:155) at
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:
> 103) Caused by: java.io.IOException: Cannot lock storage
> target/build/test/data/dfs/n
> ame1. The directory is already locked. at
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(St
> orage.java:599) at
> org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:13
> 27) at
> org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:13
> 45) at
> org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:
> 1207) at
> org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:
> 187) at
> org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:268)
> at
> org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTestCase$H
> dfsProviderTestSuite.setUp(HdfsFileProviderTestCase.java:107) ... 11
> more
>
> Running
> org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTest Tests
> run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.445 sec
> - in            
>
> When I delete the core/target/build/test/data/dfs/ directory and then
> run the ProviderTest I can do that multiple times and it works:
>
>   mvn surefire:test
> -Dtest=org.apache.commons.vfs2.provider.hdfs.test.HdfsFileProviderTest
>
> But when I run all tests or the HdfsFileProviderTestCase then it
> fails and afterwards not even the ProviderTest suceeds until I delete
> that dir.
>
> (I suspect the "locking" is a missleading error, looks more like the
> data pool has some kind of instance ID which it does not have at the
> next run)
>
> Looks like TestCase has a problem and ProviderTest does no proper
> pre-cleaning. Will check the source. More generally speaking it
> should not use a fixed working directory anyway.
>
>
> > I started up Hadoop 2.6.0 on my laptop, created a directory and
> > file, then used the VFS shell to list and view the contents
> > (remember, HDFS provider is read-only currently). Here is the what
> > I did:
>
> Looks good. I will shorten it a bit and add it to the wiki. BTW: the
> warning, is this something we can change?
>
> Gruss
> Bernd


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]