hadoop ToolRunner (org.apache.hadoop.util.ToolRunner)

하둡 map reduce를 돌릴때 Configured를 확장하고 Tool을 implements한 Driver를 많이 사용하게되는데 뭐 대충 아래 같은 식이다.

     public class MyApp extends Configured implements Tool {
     
       public int run(String[] args) throws Exception {
         // Configuration processed by ToolRunner
         Configuration conf = getConf();
         
         // Create a JobConf using the processed conf
         JobConf job = new JobConf(conf, MyApp.class);
         
         // Process custom command-line options
         Path in = new Path(args[1]);
         Path out = new Path(args[2]);
         
         // Specify various job-specific parameters     
         job.setJobName("my-app");
         job.setInputPath(in);
         job.setOutputPath(out);
         job.setMapperClass(MyMapper.class);
         job.setReducerClass(MyReducer.class);

         // Submit the job, then poll for progress until the job is complete
         JobClient.runJob(job);
         return 0;
       }
       
       public static void main(String[] args) throws Exception {
         // Let ToolRunner handle generic command-line options 
         int res = ToolRunner.run(new Configuration(), new MyApp(), args);
         
         System.exit(res);
       }

}

ToolRunner에 내가 만튼 Tool 객체랑 arguments 넣고 run호출~

ToolRunner.run(new Configuration(), new MyApp(), args);

위 부분에서 ToolRunner의 run method하는 동작을 살펴 본 결과 아래와 같다.

public static int run(Configuration conf, Tool tool, String[] args)

throws Exception{

if(conf == null) {

conf = new Configuration();

}

GenericOptionsParser parser = new GenericOptionsParser(conf, args);

//set the configuration back, so that Tool can configure itself

tool.setConf(conf);

//get the args w/o generic hadoop args

String[] toolArgs = parser.getRemainingArgs();

return tool.run(toolArgs);

}

한마디로 GenericOptionsParser로 예약되어 있는 argument 먼저 파싱후 내가 만든 Tool 객체의 run 실행

더 자세이 설명하면 GenericOptionsParser parser = new GenericOptionsParser(conf, args); 부분에서 GenericOptionParser를 사용해서 command line으로 부터 들어온 것들 중 예약되어있는 argument option을 파싱해 job conf에 들어갈 부분을 setting한 뒤 나머지 부분을 return한다.

리턴하는 부분

String[] toolArgs = parser.getRemainingArgs();

이 나머지 부분이 우리가 impleaments한 Tool 객체 그러니까 저기서는 MyApp의 run으로 넘어가서 실행되게 된다.

따라서 쉘에서 실행을 아래와 같이하이 하고 나면

hadoop jar myJar.jar com.jinuland.MyApp -files cachefile.txt -libjars mylib inputFile outputFile

실제 실행 시 ToolRunner.run에서 -files cachefile.txt -libjars mylib 부분을 다 해석해서 job의 conf에 넣은 후 Configured.run에는 남은 inputFile outputFile를 argument로 넘겨 준다.

그람 Configure의 run에서 인자로 하고 싶은 일은 하면된다.

저작자표시 비영리 동일조건

여긴지구

hadoop ToolRunner (org.apache.hadoop.util.ToolRunner)

티스토리툴바